Facebook
TwitterDescription: This dataset (Version 10) contains a collection of research papers along with various attributes and metadata. It is a comprehensive and diverse dataset that can be used for a wide range of research and analysis tasks. The dataset encompasses papers from different fields of study, including computer science, mathematics, physics, and more.
Fields in the Dataset: - id: A unique identifier for each paper. - title: The title of the research paper. - authors: The list of authors involved in the paper. - venue: The journal or venue where the paper was published. - year: The year when the paper was published. - n_citation: The number of citations received by the paper. - references: A list of paper IDs that are cited by the current paper. - abstract: The abstract of the paper.
Example: - "id": "013ea675-bb58-42f8-a423-f5534546b2b1", - "title": "Prediction of consensus binding mode geometries for related chemical series of positive allosteric modulators of adenosine and muscarinic acetylcholine receptors", - "authors": ["Leon A. Sakkal", "Kyle Z. Rajkowski", "Roger S. Armen"], - "venue": "Journal of Computational Chemistry", - "year": 2017, - "n_citation": 0, - "references": ["4f4f200c-0764-4fef-9718-b8bccf303dba", "aa699fbf-fabe-40e4-bd68-46eaf333f7b1"], - "abstract": "This paper studies ..."
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”
A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org
Please cite this when using the dataset.
Detailed description of the dataset:
1 Film Dataset: Festival Programs
The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.
The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.
The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.
The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.
2 Survey Dataset
The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.
The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.
The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.
The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.
3 IMDb & Scripts
The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.
The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.
The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.
The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.
The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.
The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.
The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.
The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.
The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.
The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.
The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.
The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.
The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.
The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.
The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.
The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.
The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.
The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.
The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.
4 Festival Library Dataset
The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.
The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories, units of measurement, data sources and coding and missing data.
The csv file “4_festival-library_dataset_imdb-and-survey” contains data on all unique festivals collected from both IMDb and survey sources. This dataset appears in wide format, all information for each festival is listed in one row. This
Facebook
TwitterThe data provided are synthetic hourly electricity load profiles for the paper and food industries for one year. The data have been synthetized from two years of measured data from industries in Chile using a comprehensive clustering analysis. The synthetic data possess the same statistical characteristics as the measured data but are provided normalized to one kW and anonymized in order to be used without confidentiality issues. Three CSV files are provided: food_i.csv, paper_i_small.csv and paper_i_large.csv containing the data of a small food processing industry, a small paper industry, and a medium-large paper industry, respectively. All the three files contain seven columns of data: weekday, month, hour, cluster, min, max, mean. The four first columns index the data in the following way:
Month: it includes the range of integer values between 1 and 12 accounting for the consecutive calendar months of a year starting in January (1) and ending in December (12).
Weekday: this column has integer values in the range 1 to 7 that are equivalent to the consecutive days of the week starting on Monday (1) and ending on Sunday (7).
Hour: it consist of integer values ranging between 1 and 24, which describe the hours of a day.
Cluster: The column “cluster” represents the cluster to which this data is associated to. The number of clusters is different for each load profile, as well as the number of days included in each cluster. Since the cluster were calculated for days, a cluster number covers 24 consecutive points of data.
The load profile data are provided in the three different columns: min, max and mean:
Min: this column provides the min value of the cluster at that time of the day. Therefore, it represents the minimum demand of electricity recorded in all the days belonging to this representative group of data.
Max. This column provides the maximum electric load of the cluster at that time of the day. It represents the maximum demand of electricity in all the days belonging to this representative group of data at that hour of the day.
Mean: This column provides the average electric load of the cluster at that time of the day. It represents the mean demand for electricity belonging to this representative group of data at that hour of the day.
The min, max and mean values are different for each hour of the day. All values are provided in values from 0 to 1 with the unit kW.
For details on the clustering procedure or the data itself please refer to the associated paper published in the journal Energy and the one published in Data in Brief journal.
The study was supported by the German Federal Ministry of Education and Research - BMBF and the Chilean National Commission for Scientific Research and Technology - CONICYT (grant number BMBF150075) , the Deutsche Gesellschaft für Internationale Zusammenarbeit (GIZ) GmbH through the Energy Program in Chile, and the European Research Council (“reFUEL” ERC-2017-STG 758149).
Facebook
TwitterThe USDA Agricultural Research Service (ARS) recently established SCINet , which consists of a shared high performance computing resource, Ceres, and the dedicated high-speed Internet2 network used to access Ceres. Current and potential SCINet users are using and generating very large datasets so SCINet needs to be provisioned with adequate data storage for their active computing. It is not designed to hold data beyond active research phases. At the same time, the National Agricultural Library has been developing the Ag Data Commons, a research data catalog and repository designed for public data release and professional data curation. Ag Data Commons needs to anticipate the size and nature of data it will be tasked with handling. The ARS Web-enabled Databases Working Group, organized under the SCINet initiative, conducted a study to establish baseline data storage needs and practices, and to make projections that could inform future infrastructure design, purchases, and policies. The SCINet Web-enabled Databases Working Group helped develop the survey which is the basis for an internal report. While the report was for internal use, the survey and resulting data may be generally useful and are being released publicly. From October 24 to November 8, 2016 we administered a 17-question survey (Appendix A) by emailing a Survey Monkey link to all ARS Research Leaders, intending to cover data storage needs of all 1,675 SY (Category 1 and Category 4) scientists. We designed the survey to accommodate either individual researcher responses or group responses. Research Leaders could decide, based on their unit's practices or their management preferences, whether to delegate response to a data management expert in their unit, to all members of their unit, or to themselves collate responses from their unit before reporting in the survey. Larger storage ranges cover vastly different amounts of data so the implications here could be significant depending on whether the true amount is at the lower or higher end of the range. Therefore, we requested more detail from "Big Data users," those 47 respondents who indicated they had more than 10 to 100 TB or over 100 TB total current data (Q5). All other respondents are called "Small Data users." Because not all of these follow-up requests were successful, we used actual follow-up responses to estimate likely responses for those who did not respond. We defined active data as data that would be used within the next six months. All other data would be considered inactive, or archival. To calculate per person storage needs we used the high end of the reported range divided by 1 for an individual response, or by G, the number of individuals in a group response. For Big Data users we used the actual reported values or estimated likely values. Resources in this dataset:Resource Title: Appendix A: ARS data storage survey questions. File Name: Appendix A.pdfResource Description: The full list of questions asked with the possible responses. The survey was not administered using this PDF but the PDF was generated directly from the administered survey using the Print option under Design Survey. Asterisked questions were required. A list of Research Units and their associated codes was provided in a drop down not shown here. Resource Software Recommended: Adobe Acrobat,url: https://get.adobe.com/reader/ Resource Title: CSV of Responses from ARS Researcher Data Storage Survey. File Name: Machine-readable survey response data.csvResource Description: CSV file includes raw responses from the administered survey, as downloaded unfiltered from Survey Monkey, including incomplete responses. Also includes additional classification and calculations to support analysis. Individual email addresses and IP addresses have been removed. This information is that same data as in the Excel spreadsheet (also provided).Resource Title: Responses from ARS Researcher Data Storage Survey. File Name: Data Storage Survey Data for public release.xlsxResource Description: MS Excel worksheet that Includes raw responses from the administered survey, as downloaded unfiltered from Survey Monkey, including incomplete responses. Also includes additional classification and calculations to support analysis. Individual email addresses and IP addresses have been removed.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel
Facebook
TwitterThis dataset provides geospatial location data and scripts used to analyze the relationship between MODIS-derived NDVI and solar and sensor angles in a pinyon-juniper ecosystem in Grand Canyon National Park. The data are provided in support of the following publication: "Solar and sensor geometry, not vegetation response, drive satellite NDVI phenology in widespread ecosystems of the western United States". The data and scripts allow users to replicate, test, or further explore results. The file GrcaScpnModisCellCenters.csv contains locations (latitude-longitude) of all the 250-m MODIS (MOD09GQ) cell centers associated with the Grand Canyon pinyon-juniper ecosystem that the Southern Colorado Plateau Network (SCPN) is monitoring through its land surface phenology and integrated upland monitoring programs. The file SolarSensorAngles.csv contains MODIS angle measurements for the pixel at the phenocam location plus a random 100 point subset of pixels within the GRCA-PJ ecosystem. The script files (folder: 'Code') consist of 1) a Google Earth Engine (GEE) script used to download MODIS data through the GEE javascript interface, and 2) a script used to calculate derived variables and to test relationships between solar and sensor angles and NDVI using the statistical software package 'R'. The file Fig_8_NdviSolarSensor.JPG shows NDVI dependence on solar and sensor geometry demonstrated for both a single pixel/year and for multiple pixels over time. (Left) MODIS NDVI versus solar-to-sensor angle for the Grand Canyon phenocam location in 2018, the year for which there is corresponding phenocam data. (Right) Modeled r-squared values by year for 100 randomly selected MODIS pixels in the SCPN-monitored Grand Canyon pinyon-juniper ecosystem. The model for forward-scatter MODIS-NDVI is log(NDVI) ~ solar-to-sensor angle. The model for back-scatter MODIS-NDVI is log(NDVI) ~ solar-to-sensor angle + sensor zenith angle. Boxplots show interquartile ranges; whiskers extend to 10th and 90th percentiles. The horizontal line marking the average median value for forward-scatter r-squared (0.835) is nearly indistinguishable from the back-scatter line (0.833). The dataset folder also includes supplemental R-project and packrat files that allow the user to apply the workflow by opening a project that will use the same package versions used in this study (eg, .folders Rproj.user, and packrat, and files .RData, and PhenocamPR.Rproj). The empty folder GEE_DataAngles is included so that the user can save the data files from the Google Earth Engine scripts to this location, where they can then be incorporated into the r-processing scripts without needing to change folder names. To successfully use the packrat information to replicate the exact processing steps that were used, the user should refer to packrat documentation available at https://cran.r-project.org/web/packages/packrat/index.html and at https://www.rdocumentation.org/packages/packrat/versions/0.5.0. Alternatively, the user may also use the descriptive documentation phenopix package documentation, and description/references provided in the associated journal article to process the data to achieve the same results using newer packages or other software programs.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Quantity of citations for Covid-19 and non-Covid-19 articles with median and range values across all fields.
Facebook
TwitterOverview
This dataset of medical misinformation was collected and is published by Kempelen Institute of Intelligent Technologies (KInIT). It consists of approx. 317k news articles and blog posts on medical topics published between January 1, 1998 and February 1, 2022 from a total of 207 reliable and unreliable sources. The dataset contains full-texts of the articles, their original source URL and other extracted metadata. If a source has a credibility score available (e.g., from Media Bias/Fact Check), it is also included in the form of annotation. Besides the articles, the dataset contains around 3.5k fact-checks and extracted verified medical claims with their unified veracity ratings published by fact-checking organisations such as Snopes or FullFact. Lastly and most importantly, the dataset contains 573 manually and more than 51k automatically labelled mappings between previously verified claims and the articles; mappings consist of two values: claim presence (i.e., whether a claim is contained in the given article) and article stance (i.e., whether the given article supports or rejects the claim or provides both sides of the argument).
The dataset is primarily intended to be used as a training and evaluation set for machine learning methods for claim presence detection and article stance classification, but it enables a range of other misinformation related tasks, such as misinformation characterisation or analyses of misinformation spreading.
Its novelty and our main contributions lie in (1) focus on medical news article and blog posts as opposed to social media posts or political discussions; (2) providing multiple modalities (beside full-texts of the articles, there are also images and videos), thus enabling research of multimodal approaches; (3) mapping of the articles to the fact-checked claims (with manual as well as predicted labels); (4) providing source credibility labels for 95% of all articles and other potential sources of weak labels that can be mined from the articles' content and metadata.
The dataset is associated with the research paper "Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims" accepted and presented at ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22).
The accompanying Github repository provides a small static sample of the dataset and the dataset's descriptive analysis in a form of Jupyter notebooks.
In order to obtain an access to the full dataset (in the CSV format), please, request the access by following the instructions provided below.
Note: Please, check also our MultiClaim Dataset that provides a more recent, a larger, and a highly multilingual dataset of fact-checked claims, social media posts and relations between them.
References
If you use this dataset in any publication, project, tool or in any other form, please, cite the following papers:
@inproceedings{SrbaMonantPlatform,
author = {Srba, Ivan and Moro, Robert and Simko, Jakub and Sevcech, Jakub and Chuda, Daniela and Navrat, Pavol and Bielikova, Maria},
booktitle = {Proceedings of Workshop on Reducing Online Misinformation Exposure (ROME 2019)},
pages = {1--7},
title = {Monant: Universal and Extensible Platform for Monitoring, Detection and Mitigation of Antisocial Behavior},
year = {2019}
}
@inproceedings{SrbaMonantMedicalDataset,
author = {Srba, Ivan and Pecher, Branislav and Tomlein Matus and Moro, Robert and Stefancova, Elena and Simko, Jakub and Bielikova, Maria},
booktitle = {Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22)},
numpages = {11},
title = {Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims},
year = {2022},
doi = {10.1145/3477495.3531726},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3477495.3531726},
}
Dataset creation process
In order to create this dataset (and to continuously obtain new data), we used our research platform Monant. The Monant platform provides so called data providers to extract news articles/blogs from news/blog sites as well as fact-checking articles from fact-checking sites. General parsers (from RSS feeds, Wordpress sites, Google Fact Check Tool, etc.) as well as custom crawler and parsers were implemented (e.g., for fact checking site Snopes.com). All data is stored in the unified format in a central data storage.
Ethical considerations
The dataset was collected and is published for research purposes only. We collected only publicly available content of news/blog articles. The dataset contains identities of authors of the articles if they were stated in the original source; we left this information, since the presence of an author's name can be a strong credibility indicator. However, we anonymised the identities of the authors of discussion posts included in the dataset.
The main identified ethical issue related to the presented dataset lies in the risk of mislabelling of an article as supporting a false fact-checked claim and, to a lesser extent, in mislabelling an article as not containing a false claim or not supporting it when it actually does. To minimise these risks, we developed a labelling methodology and require an agreement of at least two independent annotators to assign a claim presence or article stance label to an article. It is also worth noting that we do not label an article as a whole as false or true. Nevertheless, we provide partial article-claim pair veracities based on the combination of claim presence and article stance labels.
As to the veracity labels of the fact-checked claims and the credibility (reliability) labels of the articles' sources, we take these from the fact-checking sites and external listings such as Media Bias/Fact Check as they are and refer to their methodologies for more details on how they were established.
Lastly, the dataset also contains automatically predicted labels of claim presence and article stance using our baselines described in the next section. These methods have their limitations and work with certain accuracy as reported in this paper. This should be taken into account when interpreting them.
Reporting mistakes in the dataset
The mean to report considerable mistakes in raw collected data or in manual annotations is by creating a new issue in the accompanying Github repository. Alternately, general enquiries or requests can be sent at info [at] kinit.sk.
Dataset structure
Raw data
At first, the dataset contains so called raw data (i.e., data extracted by the Web monitoring module of Monant platform and stored in exactly the same form as they appear at the original websites). Raw data consist of articles from news sites and blogs (e.g. naturalnews.com), discussions attached to such articles, fact-checking articles from fact-checking portals (e.g. snopes.com). In addition, the dataset contains feedback (number of likes, shares, comments) provided by user on social network Facebook which is regularly extracted for all news/blogs articles.
Raw data are contained in these CSV files:
Note: Personal information about discussion posts' authors (name, website, gravatar) are anonymised.
Annotations
Secondly, the dataset contains so called annotations. Entity annotations describe the individual raw data entities (e.g., article, source). Relation annotations describe relation between two of such entities.
Each annotation is described by the following attributes:
At the same time, annotations are associated with a particular object identified by:
The dataset provides specifically these entity
Facebook
TwitterData protocol and datasets used for the study entitled 'A value creation model from science-society interconnections: Components and archetypes'. Abstract of the paper: The interplay between science and society takes place through a wide range of intertwined relationships and mutual influences that shape each other and facilitate continuous knowledge flows. Stylised consequentialist perspectives on valuable knowledge moving from public science to society in linear and recursive pathways, whilst informative, cannot fully capture the broad spectrum of value creation possibilities. As an alternative we experiment with an approach that gathers together diverse science-society interconnections and reciprocal research-related knowledge processes that can generate valorisation. Our approach to value creation attempts to incorporate multiple facets, directions and dynamics in which constellations of scientific and societal actors generate value from research. The paper develops a conceptual model based on a set of nine value components derived from four key research-related knowledge processes: production, translation, communication, and utilization. The paper conducts an exploratory empirical study to investigate whether a set of archetypes can be discerned among these components that structure science-society interconnections. We explore how such archetypes vary between major scientific fields. Each archetype is overlaid on a research topic map, with our results showing that different archetypes correspond to distinctive topic areas. The paper finishes by discussing the significance and limitations of our results and the potential of both our model and our empirical approach for further research.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
General
For more details and the most up-to-date information please consult our project page: https://kainmueller-lab.github.io/fisbe.
Summary
A new dataset for neuron instance segmentation in 3d multicolor light microscopy data of fruit fly brains
30 completely labeled (segmented) images
71 partly labeled images
altogether comprising ∼600 expert-labeled neuron instances (labeling a single neuron takes between 30-60 min on average, yet a difficult one can take up to 4 hours)
To the best of our knowledge, the first real-world benchmark dataset for instance segmentation of long thin filamentous objects
A set of metrics and a novel ranking score for respective meaningful method benchmarking
An evaluation of three baseline methods in terms of the above metrics and score
Abstract
Instance segmentation of neurons in volumetric light microscopy images of nervous systems enables groundbreaking research in neuroscience by facilitating joint functional and morphological analyses of neural circuits at cellular resolution. Yet said multi-neuron light microscopy data exhibits extremely challenging properties for the task of instance segmentation: Individual neurons have long-ranging, thin filamentous and widely branching morphologies, multiple neurons are tightly inter-weaved, and partial volume effects, uneven illumination and noise inherent to light microscopy severely impede local disentangling as well as long-range tracing of individual neurons. These properties reflect a current key challenge in machine learning research, namely to effectively capture long-range dependencies in the data. While respective methodological research is buzzing, to date methods are typically benchmarked on synthetic datasets. To address this gap, we release the FlyLight Instance Segmentation Benchmark (FISBe) dataset, the first publicly available multi-neuron light microscopy dataset with pixel-wise annotations. In addition, we define a set of instance segmentation metrics for benchmarking that we designed to be meaningful with regard to downstream analyses. Lastly, we provide three baselines to kick off a competition that we envision to both advance the field of machine learning regarding methodology for capturing long-range data dependencies, and facilitate scientific discovery in basic neuroscience.
Dataset documentation:
We provide a detailed documentation of our dataset, following the Datasheet for Datasets questionnaire:
FISBe Datasheet
Our dataset originates from the FlyLight project, where the authors released a large image collection of nervous systems of ~74,000 flies, available for download under CC BY 4.0 license.
Files
fisbe_v1.0_{completely,partly}.zip
contains the image and ground truth segmentation data; there is one zarr file per sample, see below for more information on how to access zarr files.
fisbe_v1.0_mips.zip
maximum intensity projections of all samples, for convenience.
sample_list_per_split.txt
a simple list of all samples and the subset they are in, for convenience.
view_data.py
a simple python script to visualize samples, see below for more information on how to use it.
dim_neurons_val_and_test_sets.json
a list of instance ids per sample that are considered to be of low intensity/dim; can be used for extended evaluation.
Readme.md
general information
How to work with the image files
Each sample consists of a single 3d MCFO image of neurons of the fruit fly.For each image, we provide a pixel-wise instance segmentation for all separable neurons.Each sample is stored as a separate zarr file (zarr is a file storage format for chunked, compressed, N-dimensional arrays based on an open-source specification.").The image data ("raw") and the segmentation ("gt_instances") are stored as two arrays within a single zarr file.The segmentation mask for each neuron is stored in a separate channel.The order of dimensions is CZYX.
We recommend to work in a virtual environment, e.g., by using conda:
conda create -y -n flylight-env -c conda-forge python=3.9conda activate flylight-env
How to open zarr files
Install the python zarr package:
pip install zarr
Opened a zarr file with:
import zarrraw = zarr.open(, mode='r', path="volumes/raw")seg = zarr.open(, mode='r', path="volumes/gt_instances")
Zarr arrays are read lazily on-demand.Many functions that expect numpy arrays also work with zarr arrays.Optionally, the arrays can also explicitly be converted to numpy arrays.
How to view zarr image files
We recommend to use napari to view the image data.
Install napari:
pip install "napari[all]"
Save the following Python script:
import zarr, sys, napari
raw = zarr.load(sys.argv[1], mode='r', path="volumes/raw")gts = zarr.load(sys.argv[1], mode='r', path="volumes/gt_instances")
viewer = napari.Viewer(ndisplay=3)for idx, gt in enumerate(gts): viewer.add_labels( gt, rendering='translucent', blending='additive', name=f'gt_{idx}')viewer.add_image(raw[0], colormap="red", name='raw_r', blending='additive')viewer.add_image(raw[1], colormap="green", name='raw_g', blending='additive')viewer.add_image(raw[2], colormap="blue", name='raw_b', blending='additive')napari.run()
Execute:
python view_data.py /R9F03-20181030_62_B5.zarr
Metrics
S: Average of avF1 and C
avF1: Average F1 Score
C: Average ground truth coverage
clDice_TP: Average true positives clDice
FS: Number of false splits
FM: Number of false merges
tp: Relative number of true positives
For more information on our selected metrics and formal definitions please see our paper.
Baseline
To showcase the FISBe dataset together with our selection of metrics, we provide evaluation results for three baseline methods, namely PatchPerPix (ppp), Flood Filling Networks (FFN) and a non-learnt application-specific color clustering from Duan et al..For detailed information on the methods and the quantitative results please see our paper.
License
The FlyLight Instance Segmentation Benchmark (FISBe) dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
Citation
If you use FISBe in your research, please use the following BibTeX entry:
@misc{mais2024fisbe, title = {FISBe: A real-world benchmark dataset for instance segmentation of long-range thin filamentous structures}, author = {Lisa Mais and Peter Hirsch and Claire Managan and Ramya Kandarpa and Josef Lorenz Rumberger and Annika Reinke and Lena Maier-Hein and Gudrun Ihrke and Dagmar Kainmueller}, year = 2024, eprint = {2404.00130}, archivePrefix ={arXiv}, primaryClass = {cs.CV} }
Acknowledgments
We thank Aljoscha Nern for providing unpublished MCFO images as well as Geoffrey W. Meissner and the entire FlyLight Project Team for valuablediscussions.P.H., L.M. and D.K. were supported by the HHMI Janelia Visiting Scientist Program.This work was co-funded by Helmholtz Imaging.
Changelog
There have been no changes to the dataset so far.All future change will be listed on the changelog page.
Contributing
If you would like to contribute, have encountered any issues or have any suggestions, please open an issue for the FISBe dataset in the accompanying github repository.
All contributions are welcome!
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SBP, systolic blood pressure; pH, the hydrogen ion concentration.Categorical data are expressed as number of patients, and continuous data are expressed as mean value (range of the data).* p value < 0.05.
Facebook
Twitter
As per our latest research, the global satellite data services market size reached USD 8.7 billion in 2024, driven by increasing demand for real-time geospatial intelligence and advanced analytics across multiple industries. The market is poised for robust expansion, registering a CAGR of 18.2% from 2025 to 2033. By 2033, the satellite data services market is forecasted to attain a value of USD 44.1 billion, propelled by technological advancements, the proliferation of small satellite constellations, and growing integration of satellite data into commercial applications. This growth trajectory underscores the transformative impact of satellite data on decision-making processes and operational efficiency across global sectors.
One of the principal growth factors for the satellite data services market is the surge in demand for high-resolution imagery and geospatial analytics across sectors such as agriculture, energy, defense, and environmental monitoring. The rapid digitization of industries and the need for precise, real-time data to support critical operations have fueled investments in satellite data services. Additionally, the increasing frequency of natural disasters and the growing importance of climate change monitoring have necessitated the use of satellite-based solutions for timely and accurate information. The integration of artificial intelligence and machine learning with satellite data analytics has further amplified the value proposition of these services, enabling predictive insights and automated anomaly detection for enhanced decision-making.
Another significant driver is the expansion of small satellite constellations and the decreasing cost of satellite launches, which have democratized access to satellite data. The advent of low Earth orbit (LEO) satellites has revolutionized data acquisition, offering improved revisit rates and cost-effective solutions for commercial and governmental clients. The proliferation of private players and public-private partnerships has accelerated innovation in satellite data services, resulting in enhanced data quality, faster delivery times, and a wider range of value-added services. This democratization has opened new avenues for start-ups and SMEs, fostering a competitive environment that stimulates continuous technological advancement and market expansion.
The satellite data services market is also benefiting from increased government initiatives and policy support for space-based infrastructure and data utilization. Governments worldwide are investing in satellite programs to bolster national security, disaster management, and socio-economic development. These initiatives have led to greater collaboration between governmental agencies and private enterprises, promoting the adoption of satellite data for urban planning, resource management, and infrastructure development. Moreover, international efforts to standardize satellite data formats and improve interoperability are facilitating cross-border data sharing, thereby expanding the global reach and utility of satellite data services.
In the rapidly evolving landscape of satellite data services, Space Data Replay and Reprocessing Services are emerging as crucial components for enhancing data utility and accessibility. These services facilitate the retrieval and reprocessing of archived satellite data, enabling users to extract new insights and revisit past events with enhanced analytical capabilities. By leveraging advanced algorithms and cloud-based platforms, Space Data Replay and Reprocessing Services allow for the refinement of historical data, providing a more comprehensive understanding of temporal changes and trends. This capability is particularly valuable for sectors such as environmental monitoring and disaster management, where historical data can inform future strategies and improve response times. As the demand for historical data analysis grows, these services are becoming integral to maximizing the value of satellite data investments.
Regionally, North America remains the largest market for satellite data services, accounting for over 37% of global revenue in 2024, driven by the presence of leading satellite operators, advanced technological infrastructure, and substantial government funding. Eu
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Results data and figures for the journal paper.Dataset includes compressed Python Pickle files containing Dictionaries of NumPy arrays and metadata for each figure. This contains input and output data. Also includes image files for each figure are also included in PNG, SVG, and PDF.Journal paper introduces RUVPY, a Python software library which implements the Relative Utility Value (RUV) method. This is available at https://github.com/richardlaugesen/ruvpy and can now be used by researchers and industry to quantify the value of forecast for decision making (pip install ruvpy).ReferencesLaugesen, R., Thyer, M., McInerney, D., & Kavetski, D. (2025), Software Library to Quantify the Value of Forecasts for Decision-Making: Case Study on Sensitivity to Damages. Environmental Modelling and Software. https://doi.org/10.1016/j.envsoft.2025.106697Laugesen, R., Thyer, M., McInerney, D., & Kavetski, D. (2023). Flexible forecast value metric suitable for a wide range of decisions: application using probabilistic subseasonal streamflow forecasts. Hydrology and Earth System Sciences, 27(4), 873-893. https://doi.org/10.5194/hess-27-873-2023Laugesen, R. (2025). RUVPY software library to quantify the value of forecasts for decision-making using RUV (v1.0.0). Zenodo. https://doi.org/10.5281/zenodo.15825583
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The USDA Forest Service (USFS) builds two versions of percent tree canopy cover data, in order to serve needs of multiple user communities. These datasets encompass conterminous United States (CONUS), Coastal Alaska, Hawaii, and Puerto Rico and U.S. Virgin Islands (PRUSVI). The two versions of data within the v2023-5 TCC product suite include: The initial model outputs referred to as the Science data; And a modified version built for the National Land Cover Database and referred to as NLCD data. The NLCD product suite includes data for years 1985 through 2023. The NCLD data are processed to mask TCC from non-treed features such as water and non-tree crops, and to reduce interannual noise and smooth the NLCD time series. TCC pixel values range from 0 to 100 percent. The non-processing area is represented by value 254, and the background is represented by the value 255. The Science and NLCD tree canopy cover data are accessible for multiple user communities, through multiple channels and platforms. For information on the Science data and processing steps see the Science metadata. Information on the NLCD data and processing steps are included here. Data Download and Methods Documents: - https://data.fs.usda.gov/geodata/rastergateway/treecanopycover/ This record was taken from the USDA Enterprise Data Inventory that feeds into the https://data.gov catalog. Data for this record includes the following resources: ISO-19139 metadata ArcGIS Hub Dataset ArcGIS GeoService For complete information, please visit https://data.gov.
Facebook
Twitterhttps://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The global data science services market is projected to experience significant growth, reaching a value of 73060 million by 2033, expanding at a CAGR of 18.2% from 2025 to 2033. The surge in data generation, the increasing adoption of artificial intelligence (AI) and machine learning (ML), and the growing need for data-driven decision-making in various industries are major factors driving market growth. Additionally, the increasing demand for cloud-based data science services and the rise of data science-as-a-service (DSaaS) offerings are further contributing to market expansion. Key market trends include the increasing adoption of data science services by small and medium-sized enterprises (SMEs) and the growing demand for data scientists with specialized skills. The market is segmented into different applications and types, with data collection and data cleaning being the most prominent segments. North America holds a dominant share of the market, followed by Europe and Asia Pacific. Key players in the market include EY, Deloitte, KPMG, McKinsey & Company, and Boston Consulting Group, among others. These companies offer a range of data science services, including data analytics, data visualization, and predictive modeling. The market is expected to face challenges such as data privacy and security concerns, as well as the shortage of qualified data science professionals. However, ongoing advancements in technology, the growing adoption of AI and ML, and the increasing awareness of the benefits of data science services are expected to drive continued growth in the market.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Journal level disaggregated data of the review, acceptance and publication dates of a sample of 21890 articles from 326 Ibero-American scientific journals from all subject areas and countries included in the Latindex Catalogue 2.0 and published between 2018 and 2020. The variable included are: journal identifier; ISSN; journal title; identifier of the country/region of the journal; literal of the country/region of the journal; subject area identifier; subject area literal; journal periodicity identifier; journal periodicity; number of articles of each journal included in the study; average, median, minimum, maximum, range, standard deviation of evaluation days; average, median, minimum, maximum, range, standard deviation of publication days; average of total days calculated in weeks; median of total days calculated in weeks; number of weeks of the total process (from reception to publication) declared by the publishers at DOAJ; difference between total days calculated in weeks and data declared by the publishers at DOAJ; indication of a journal maintains a paper version; Compound Index of Secondary Diffusion (ICDS).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The massive amount of vehicle plate data generated by intelligent transportation systems is widely used in the field of urban transportation information system construction and has a high scientific research and application value. The adoption of big data platforms to properly preserve, process, and exploit these valuable data resources has become a hot research area in recent years. To address the problems of implementing complex multi-conditional comprehensive query functions and flexible data applications in the key–value database storage environment of a big data platform, this paper proposes a data access model based on the jump hash consistency algorithm. Algorithms such as data slice storage and multi-threaded sliding window parallel reading are used to realize evenly distributed storage and fast reading of massive time-series data on clustered data nodes. A comparative analysis of data distribution uniformity and retrieval efficiency shows that the model can effectively avoid generating the cluster hotspot problem, support comprehensive analysis queries with various complex conditions, and maintain high query efficiency by precisely positioning the data storage range and utilizing parallel scan reading.
Facebook
TwitterThis dataset is a collection of marine environmental data layers suitable for use in Southern Ocean species distribution modelling. All environmental layers have been generated at a spatial resolution of 0.1 degrees, covering the Southern Ocean extent (80 degrees S - 45 degrees S, -180 - 180 degrees). The layers include information relating to bathymetry, sea ice, ocean currents, primary production, particulate organic carbon, and other oceanographic data.
An example of reading and using these data layers in R can be found at https://australianantarcticdivision.github.io/blueant/articles/SO_SDM_data.html.
The following layers are provided:
Source: This study. Derived from GEBCO URL: https://www.gebco.net/data_and_products/gridded_bathymetry_data/ Citation: Fabri-Ruiz S, Saucede T, Danis B and David B (2017). Southern Ocean Echinoids database_An updated version of Antarctic, Sub-Antarctic and cold temperate echinoid database. ZooKeys, (697), 1.
Layer name: geomorphology Description: Last update on biodiversity.aq portal. Derived from O'Brien et al. (2009) seafloor geomorphic feature dataset. Mapping based on GEBCO contours, ETOPO2, seismic lines). 27 categories Value range: 27 categories Units: categorical Source: This study. Derived from Australian Antarctic Data Centre URL: https://data.aad.gov.au/metadata/records/Polar_Environmental_Data Citation: O'Brien, P.E., Post, A.L., and Romeyn, R. (2009) Antarctic-wide geomorphology as an aid to habitat mapping and locating vulnerable marine ecosystems. CCAMLR VME Workshop 2009. Document WS-VME-09/10
Layer name: sediments Description: Sediment features Value range: 14 categories Units: categorical Source: Griffiths 2014 (unpublished) URL: http://share.biodiversity.aq/GIS/antarctic/
Layer name: slope Description: Seafloor slope derived from bathymetry with the terrain function of raster R package. Computation according to Horn (1981), ie option neighbor=8. The computation was done on the GEBCO bathymetry layer (0.0083 degrees resolution) and the resolution was then changed to 0.1 degrees. Unit set at degrees. Value range: 0.000252378 - 16.94809 Units: degrees Source: This study. Derived from GEBCO URL: https://www.gebco.net/data_and_products/gridded_bathymetry_data/ Citation: Horn, B.K.P., 1981. Hill shading and the reflectance map. Proceedings of the IEEE 69:14-47
Layer name: roughness Description: Seafloor roughness derived from bathymetry with the terrain function of raster R package. Roughness is the difference between the maximum and the minimum value of a cell and its 8 surrounding cells. The computation was done on the GEBCO bathymetry layer (0.0083 degrees resolution) and the resolution was then changed to 0.1 degrees. Value range: 0 - 5171.278 Units: unitless Source: This study. Derived from GEBCO URL: https://www.gebco.net/data_and_products/gridded_bathymetry_data/
Layer name: mixed layer depth Description: Summer mixed layer depth climatology from ARGOS data. Regridded from 2-degree grid using nearest neighbour interpolation Value range: 13.79615 - 461.5424 Units: m Source: https://data.aad.gov.au/metadata/records/Polar_Environmental_Data
Layer name: seasurface_current_speed Description: Current speed near the surface (2.5m depth), derived from the CAISOM model (Galton-Fenzi et al. 2012, based on ROMS model) Value range: 1.50E-04 - 1.7 Units: m/s Source: This study. Derived from Australian Antarctic Data Centre URL: https://data.aad.gov.au/metadata/records/Polar_Environmental_Data Citation: see Galton-Fenzi BK, Hunter JR, Coleman R, Marsland SJ, Warner RC (2012) Modeling the basal melting and marine ice accretion of the Amery Ice Shelf. Journal of Geophysical Research: Oceans, 117, C09031. http://dx.doi.org/10.1029/2012jc008214, https://data.aad.gov.au/metadata/records/polar_environmental_data
Layer name: seafloor_current_speed Description: Current speed near the sea floor, derived from the CAISOM model (Galton-Fenzi et al. 2012, based on ROMS) Value range: 3.40E-04 - 0.53 Units: m/s Source: This study. Derived from Australian Antarctic Data Centre URL: https://data.aad.gov.au/metadata/records/Polar_Environmental_Data Citation: see Galton-Fenzi BK, Hunter JR, Coleman R, Marsland SJ, Warner RC (2012) Modeling the basal melting and marine ice accretion of the Amery Ice Shelf. Journal of Geophysical Research: Oceans, 117, C09031. http://dx.doi.org/10.1029/2012jc008214, https://data.aad.gov.au/metadata/records/polar_environmental_data
Layer name: distance_antarctica Description: Distance to the nearest part of the Antarctic continent Value range: 0 - 3445 Units: km Source: https://data.aad.gov.au/metadata/records/Polar_Environmental_Data
Layer name: distance_canyon Description: Distance to the axis of the nearest canyon Value range: 0 - 3117 Units: km Source: https://data.aad.gov.au/metadata/records/Polar_Environmental_Data
Layer name: distance_max_ice_edge Description: Distance to the mean maximum winter sea ice extent (derived from daily estimates of sea ice concentration) Value range: -2614.008 - 2314.433 Units: km Source: https://data.aad.gov.au/metadata/records/Polar_Environmental_Data
Layer name: distance_shelf Description: Distance to nearest area of seafloor of depth 500m or shallower Value range: -1296 - 1750 Units: km Source: https://data.aad.gov.au/metadata/records/Polar_Environmental_Data
Layer name: ice_cover_max Description: Ice concentration fraction, maximum on [1957-2017] time period Value range: 0 - 1 Units: unitless Source: BioOracle accessed 24/04/2018, see Assis et al. (2018) URL: http://www.bio-oracle.org/ Citation: Assis J, Tyberghein L, Bosch S, Verbruggen H, Serrao EA and De Clerck O (2018). Bio_ORACLE v2. 0: Extending marine data layers for bioclimatic modelling. Global Ecology and Biogeography, 27(3), 277-284 , see also https://www.ecmwf.int/en/research/climate-reanalysis/ocean-reanalysis
Layer name: ice_cover_mean Description: Ice concentration fraction, mean on [1957-2017] time period Value range: 0 - 0.9708595 Units: unitless Source: BioOracle accessed 24/04/2018, see Assis et al. (2018) URL: http://www.bio-oracle.org/ Citation: Assis J, Tyberghein L, Bosch S, Verbruggen H, Serrao EA and De Clerck O (2018). Bio_ORACLE v2. 0: Extending marine data layers for bioclimatic modelling. Global Ecology and Biogeography, 27(3), 277-284 , see also https://www.ecmwf.int/en/research/climate-reanalysis/ocean-reanalysis
Layer name: ice_cover_min Description: Ice concentration fraction, minimum on [1957-2017] time period Value range: 0 - 0.8536261 Units: unitless Source: BioOracle accessed 24/04/2018, see Assis et al. (2018) URL: http://www.bio-oracle.org/ Citation: Assis J, Tyberghein L, Bosch S, Verbruggen H, Serrao EA and De Clerck O (2018). Bio_ORACLE v2. 0: Extending marine data layers for bioclimatic modelling. Global Ecology and Biogeography, 27(3), 277-284 , see also https://www.ecmwf.int/en/research/climate-reanalysis/ocean-reanalysis
Layer name: ice_cover_range Description: Ice concentration fraction, difference maximum-minimum on [1957-2017] time period Value range: 0 - 1 Units: unitless Source: BioOracle accessed 24/04/2018, see Assis et al. (2018) URL: http://www.bio-oracle.org/ Citation: Assis J, Tyberghein L, Bosch S, Verbruggen H, Serrao EA and De Clerck O (2018). Bio_ORACLE v2. 0: Extending marine data layers for bioclimatic modelling. Global Ecology and Biogeography, 27(3), 277-284 , see also https://www.ecmwf.int/en/research/climate-reanalysis/ocean-reanalysis
Layer name: ice_thickness_max Description: Ice thickness, maximum on [1957-2017] time period Value range: 0 - 3.471811 Units: m Source: BioOracle accessed 24/04/2018, see Assis et al. (2018) URL: http://www.bio-oracle.org/ Citation: Assis J, Tyberghein L, Bosch S, Verbruggen H, Serrao EA and De Clerck O (2018). Bio_ORACLE v2. 0: Extending marine data layers for bioclimatic modelling. Global Ecology and Biogeography, 27(3), 277-284 , see also https://www.ecmwf.int/en/research/climate-reanalysis/ocean-reanalysis
Layer name: ice_thickness_mean Description: Ice thickness, mean on [1957-2017] time period Value range: 0 - 1.614133 Units: m Source: BioOracle accessed 24/04/2018, see Assis et al. (2018) URL: http://www.bio-oracle.org/ Citation: Assis J, Tyberghein L, Bosch S, Verbruggen H, Serrao EA and De Clerck O (2018). Bio_ORACLE v2. 0: Extending marine data layers for bioclimatic modelling. Global Ecology and Biogeography, 27(3), 277-284 , see also https://www.ecmwf.int/en/research/climate-reanalysis/ocean-reanalysis
Layer name: ice_thickness_min Description: Ice thickness, minimum on [1957-2017] time period Value range: 0 - 0.7602701 Units: m Source: BioOracle accessed 24/04/2018, see Assis et al. (2018) URL: http://www.bio-oracle.org/ Citation: Assis J, Tyberghein L, Bosch S, Verbruggen H, Serrao EA and De Clerck O (2018). Bio_ORACLE v2. 0: Extending marine data layers for bioclimatic modelling. Global Ecology and Biogeography, 27(3), 277-284 , see also https://www.ecmwf.int/en/research/climate-reanalysis/ocean-reanalysis
Layer name: ice_thickness_range Description: Ice thickness, difference maximum-minimum on [1957-2017] time period Value range: 0 - 3.471811 Units: m Source: BioOracle accessed 24/04/2018, see Assis et al. (2018) URL: http://www.bio-oracle.org/ Citation: Assis J, Tyberghein L, Bosch S, Verbruggen H, Serrao EA and De Clerck O (2018). Bio_ORACLE v2. 0: Extending marine data layers for bioclimatic modelling. Global Ecology and Biogeography, 27(3), 277-284 , see also https://www.ecmwf.int/en/research/climate-reanalysis/ocean-reanalysis
Layer name:
Facebook
Twitter
According to our latest research, the global Data Science Notebook as a Service market size reached USD 2.1 billion in 2024, reflecting robust adoption across industries driven by the need for scalable, collaborative analytics platforms. The market is exhibiting a strong compound annual growth rate (CAGR) of 27.6% and is anticipated to reach USD 15.6 billion by 2033, as per our projections. This impressive growth trajectory is primarily attributed to the rising demand for advanced analytics, machine learning, and seamless collaboration capabilities in data-driven organizations.
The rapid expansion of the Data Science Notebook as a Service market is underpinned by the increasing complexity of data environments and the need for integrated platforms that facilitate efficient data exploration, analysis, and visualization. Enterprises are transitioning away from traditional, siloed analytics tools in favor of cloud-based, collaborative notebook solutions that support real-time interaction and remote teamwork. The proliferation of big data, the democratization of data science, and the growing reliance on AI and machine learning models are further catalyzing market growth, as organizations seek tools that streamline the end-to-end analytics lifecycle. The flexibility and scalability offered by notebook as a service platforms are also critical factors driving adoption, particularly as businesses prioritize agility and rapid innovation in a competitive digital landscape.
Another major growth factor is the surge in remote and hybrid work models, which have fundamentally altered how teams interact with data and collaborate on analytics projects. Data Science Notebook as a Service platforms enable geographically dispersed teams to share code, insights, and visualizations in real time, fostering a culture of transparency and knowledge sharing. This capability is especially valuable in research-driven sectors such as healthcare, finance, and academia, where cross-functional collaboration is essential for innovation. Additionally, the integration of advanced security features and compliance tools has made these platforms more attractive to enterprises operating in regulated industries, further expanding the addressable market.
The evolution of AI and machine learning technologies is also fueling demand for Data Science Notebook as a Service solutions. As organizations increasingly embed predictive analytics and automation into their core operations, there is a growing need for platforms that support the full data science workflow – from data ingestion and preprocessing to model development, training, and deployment. Modern notebook services are integrating with a wide array of data sources, cloud infrastructures, and MLOps tools, enabling seamless scalability and operationalization of analytics. This integration is reducing the time-to-value for advanced analytics initiatives and empowering a broader range of users, including citizen data scientists and business analysts, to participate in data-driven decision-making.
From a regional perspective, North America currently dominates the Data Science Notebook as a Service market, accounting for the largest revenue share in 2024. The region’s leadership is driven by the high concentration of technology innovators, early adopters, and significant investments in digital transformation initiatives. However, Asia Pacific is emerging as the fastest-growing region, propelled by rapid digitalization, expanding enterprise IT infrastructure, and the rise of data-centric industries in countries like China, India, and Japan. Europe is also witnessing substantial growth, supported by strong regulatory frameworks, increased cloud adoption, and a focus on data-driven innovation across sectors. As global organizations continue to prioritize data science capabilities, the market is expected to see robust growth across all major regions.
The Component segment
Facebook
TwitterPlease cite the following paper when using this dataset: N. Thakur, “Twitter Big Data as a Resource for Exoskeleton Research: A Large-Scale Dataset of about 140,000 Tweets and 100 Research Questions,” Preprints, 2022, DOI: 10.20944/preprints202206.0383.v1 Abstract The exoskeleton technology has been rapidly advancing in the recent past due to its multitude of applications and use cases in assisted living, military, healthcare, firefighting, and industries. With the projected increase in the diverse uses of exoskeletons in the next few years in these application domains and beyond, it is crucial to study, interpret, and analyze user perspectives, public opinion, reviews, and feedback related to exoskeletons, for which a dataset is necessary. The Internet of Everything era of today's living, characterized by people spending more time on the Internet than ever before, holds the potential for developing such a dataset by mining relevant web behavior data from social media communications, which have increased exponentially in the last few years. Twitter, one such social media platform, is highly popular amongst all age groups, who communicate on diverse topics including but not limited to news, current events, politics, emerging technologies, family, relationships, and career opportunities, via tweets, while sharing their views, opinions, perspectives, and feedback towards the same. Therefore, this work presents a dataset of about 140,000 Tweets related to exoskeletons. that were mined for a period of 5-years from May 21, 2017, to May 21, 2022. The tweets contain diverse forms of communications and conversations which communicate user interests, user perspectives, public opinion, reviews, feedback, suggestions, etc., related to exoskeletons. Instructions: This dataset contains about 140,000 Tweets related to exoskeletons. that were mined for a period of 5-years from May 21, 2017, to May 21, 2022. The tweets contain diverse forms of communications and conversations which communicate user interests, user perspectives, public opinion, reviews, feedback, suggestions, etc., related to exoskeletons. The dataset contains only tweet identifiers (Tweet IDs) due to the terms and conditions of Twitter to re-distribute Twitter data only for research purposes. They need to be hydrated to be used. The process of retrieving a tweet's complete information (such as the text of the tweet, username, user ID, date and time, etc.) using its ID is known as the hydration of a tweet ID. The Hydrator application (link to download the application: https://github.com/DocNow/hydrator/releases and link to a step-by-step tutorial: https://towardsdatascience.com/learn-how-to-easily-hydrate-tweets-a0f393ed340e#:~:text=Hydrating%20Tweets) or any similar application may be used for hydrating this dataset. Data Description This dataset consists of 7 .txt files. The following shows the number of Tweet IDs and the date range (of the associated tweets) in each of these files. Filename: Exoskeleton_TweetIDs_Set1.txt (Number of Tweet IDs – 22945, Date Range of Tweets - July 20, 2021 – May 21, 2022) Filename: Exoskeleton_TweetIDs_Set2.txt (Number of Tweet IDs – 19416, Date Range of Tweets - Dec 1, 2020 – July 19, 2021) Filename: Exoskeleton_TweetIDs_Set3.txt (Number of Tweet IDs – 16673, Date Range of Tweets - April 29, 2020 - Nov 30, 2020) Filename: Exoskeleton_TweetIDs_Set4.txt (Number of Tweet IDs – 16208, Date Range of Tweets - Oct 5, 2019 - Apr 28, 2020) Filename: Exoskeleton_TweetIDs_Set5.txt (Number of Tweet IDs – 17983, Date Range of Tweets - Feb 13, 2019 - Oct 4, 2019) Filename: Exoskeleton_TweetIDs_Set6.txt (Number of Tweet IDs – 34009, Date Range of Tweets - Nov 9, 2017 - Feb 12, 2019) Filename: Exoskeleton_TweetIDs_Set7.txt (Number of Tweet IDs – 11351, Date Range of Tweets - May 21, 2017 - Nov 8, 2017) Here, the last date for May is May 21 as it was the most recent date at the time of data collection. The dataset would be updated soon to incorporate more recent tweets.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In the field of Computer Science, conference and workshop papers serve as important contributions, carrying substantial weight in research assessment processes, compared to other disciplines. However, a considerable number of these papers are not assigned a Digital Object Identifier (DOI), hence their citations are not reported in widely used citation datasets like OpenCitations and Crossref, raising limitations to citation analysis. While the Microsoft Academic Graph (MAG) previously addressed this issue by providing substantial coverage, its discontinuation has created a void in available data. BIP! NDR aims to alleviate this issue and enhance the research assessment processes within the field of Computer Science. To accomplish this, it leverages a workflow that identifies and retrieves Open Science papers lacking DOIs from the DBLP Corpus, and by performing text analysis, it extracts citation information directly from their full text. The current version of the dataset contains more than 2.1M citations made by approximately 147K open access Computer Science conference or workshop papers that, according to DBLP, do not have a DOI. File Structure: The dataset is formatted as a JSON Lines (JSONL) file (one JSON Object per line) to facilitate file splitting and streaming. Each JSON object has three main fields: “_id”: a unique identifier, “citing_paper”, the “dblp_id” of the citing paper, “cited_papers”: array containing the objects that correspond to each reference found in the text of the “citing_paper”; each object may contain the following fields: “dblp_id”: the “dblp_id” of the cited paper. Optional - this field is required if a “doi” is not present. “doi”: the doi of the cited paper. Optional - this field is required if a “dblp_id” is not present. “bibliographic_reference”: the raw citation string as it appears in the citing paper. Changes from previous version: Replaced the PDF Downloader module with PublicationsRetriever (https://github.com/LSmyrnaios/PublicationsRetriever) to cover the full range of available URLs Fixed a bug that affected how the DBLP IDs were allocated to the downloaded PDF files (this bug affected records in the previous versions of the dataset).
Facebook
TwitterDescription: This dataset (Version 10) contains a collection of research papers along with various attributes and metadata. It is a comprehensive and diverse dataset that can be used for a wide range of research and analysis tasks. The dataset encompasses papers from different fields of study, including computer science, mathematics, physics, and more.
Fields in the Dataset: - id: A unique identifier for each paper. - title: The title of the research paper. - authors: The list of authors involved in the paper. - venue: The journal or venue where the paper was published. - year: The year when the paper was published. - n_citation: The number of citations received by the paper. - references: A list of paper IDs that are cited by the current paper. - abstract: The abstract of the paper.
Example: - "id": "013ea675-bb58-42f8-a423-f5534546b2b1", - "title": "Prediction of consensus binding mode geometries for related chemical series of positive allosteric modulators of adenosine and muscarinic acetylcholine receptors", - "authors": ["Leon A. Sakkal", "Kyle Z. Rajkowski", "Roger S. Armen"], - "venue": "Journal of Computational Chemistry", - "year": 2017, - "n_citation": 0, - "references": ["4f4f200c-0764-4fef-9718-b8bccf303dba", "aa699fbf-fabe-40e4-bd68-46eaf333f7b1"], - "abstract": "This paper studies ..."