Facebook
Twitterhttps://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.11588/DATA/FJAG0Xhttps://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.11588/DATA/FJAG0X
Background Data sharing is commonly seen as beneficial for science, but is not yet common practice. Research funding agencies are known to play a key role in promoting data sharing, but German funders’ data sharing policies appear to lag behind in international comparison. This study aims to answer the question of how German data sharing experts inside and outside funding agencies perceive and evaluate German funders’ data sharing policies and overall efforts to promote data sharing. Methods This study is based on sixteen guideline-structured interviews with representatives of German funding agencies and German research data experts from other organisations, who shared their perceptions of German’ funders efforts to promote data sharing. By applying the method of qualitative content analysis to our interview data, we categorise and describe noteworthy aspects of the German data sharing policy landscape and illustrate our findings with interview passages. Research data This dataset contains summaries from interviews with data sharing and funding policy experts from German funding agencies and what we call "stakeholder organisations" (e.g., universities, research data infrastructure providers, etc.). We asked the interviewees about their perspectives on German funders' data sharing policies, for example regarding the actual status quo, their expectations about the potential role that funders can play in promoting data sharing, as well as general developments in this area. Supplement_1_Interview_guideline_funders.pdf and Supplement_2_Interview_guideline_stakeholders.pdf provide supplemental information in the form of the (german) interview guidelines used in this study. Supplement_3_Transcription_and_coding_guideline.pdf lays out the rules we followed in our transcription and coding process. Supplement_4_Category_system.pdf describes the underlying category system of the qualitative content analysis we conducted.
Facebook
TwitterIn order to request access to this data please complete the data request form.* * University of Bristol staff should use this form instead. The ASK feasibility trial: a randomised controlled feasibility trial and process evaluation of a complex multicomponent intervention to improve AccesS to living-donor Kidney transplantation This trial was a two-arm, parallel group, pragmatic, individually-randomised, controlled, feasibility trial, comparing usual care with a multicomponent intervention to increase access to living-donor kidney transplantation. The trial was based at two UK hospitals: a transplanting hospital and a non-transplanting referral hospital. 62 participants were recruited. 60 participants consented to data sharing, and their trial data is available here. 2 participants did not consent to data sharing and their data is not available. This project contains: 1. The ASK feasibility trial dataset 2. The trial questionnaire 3. An example consent form 4. Trial information sheet This dataset is part of a series: ASK feasibility trial documents: https://doi.org/10.5523/bris.1u5ooi0iqmb5c26zwim8l7e8rm The ASK feasibility trial: CONSORT documents: https://doi.org/10.5523/bris.2iq6jzfkl6e1x2j1qgfbd2kkbb The ASK feasibility trial: Wellcome Open Research CONSORT checklist: https://doi.org/10.5523/bris.1m3uhbdfdrykh27iij5xck41le The ASK feasibility trial: qualitative data: https://doi.org/10.5523/bris.1qm9yblprxuj2qh3o0a2yylgg
Facebook
TwitterThe incorporation of data sharing into the research lifecycle is an important part of modern scholarly debate. In this study, the DataONE Usability and Assessment working group addresses two primary goals: To examine the current state of data sharing and reuse perceptions and practices among research scientists as they compare to the 2009/2010 baseline study, and to examine differences in practices and perceptions across age groups, geographic regions, and subject disciplines. We distributed surveys to a multinational sample of scientific researchers at two different time periods (October 2009 to July 2010 and October 2013 to March 2014) to observe current states of data sharing and to see what, if any, changes have occurred in the past 3–4 years. We also looked at differences across age, geographic, and discipline-based groups as they currently exist in the 2013/2014 survey. Results point to increased acceptance of and willingness to engage in data sharing, as well as an increase in actual data sharing behaviors. However, there is also increased perceived risk associated with data sharing, and specific barriers to data sharing persist. There are also differences across age groups, with younger respondents feeling more favorably toward data sharing and reuse, yet making less of their data available than older respondents. Geographic differences exist as well, which can in part be understood in terms of collectivist and individualist cultural differences. An examination of subject disciplines shows that the constraints and enablers of data sharing and reuse manifest differently across disciplines. Implications of these findings include the continued need to build infrastructure that promotes data sharing while recognizing the needs of different research communities. Moving into the future, organizations such as DataONE will continue to assess, monitor, educate, and provide the infrastructure necessary to support such complex grand science challenges.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset supports the publication "Data Sharing and Use in Cybersecurity Research" by I. Kouper and S. Stone (in the CODATA Data Science Journal).Paper abstract: Data sharing is crucial for strengthening research integrity and outcomes, and for addressing complex problems. In cybersecurity research, data sharing can enable the development of new security measures, prediction of malicious attacks, and increased privacy. Understanding the landscape of data sharing and use in cybersecurity research can help to improve both the existing practices of data management and use and the outcomes of cybersecurity research. To this end, this study used methods of qualitative analysis and descriptive statistics to analyze 171 papers published between 2015 and 2019, their authors' characteristics, such as gender and professional title, and datasets' attributes, including their origin and public availability. The study found that more than half of the datasets in the sample (58%) and an even larger percentage of code in the papers (89%) were not publicly available. By offering an updated in-depth perspective on data practices in cybersecurity, including the role of authors, research methods, data sharing, and code availability, this study calls for the improvement of data management in cybersecurity research and for further collaboration in addressing the issues of cyberinfrastructure, policies, and citation and attribution standards in order to advance the quality and availability of data in this field.The dataset consists of four files:codebook.xlsx - the codebook of the study that contains coding categories, coding variables and their descriptions and specific codes developed to describe the data.publications.xlsx - metadata and coded content for 171 publications collected for the study.first authors.xlsx - metadata and coded content for first authors of the analyzed publications. Emails and web_links (links to CV or individual homepage) were removed for privacy reasons.datasets.xlsx - metadata and coded content for 387 unique datasets identified in the examined publications.Suggested citation for the dataset: Kouper, I. & Stone S. (2023). Cybersecurity research publications, authors, and datasets 2015-2019. [Figshare.] DOI: 10.6084/m9.figshare.24639387
Facebook
TwitterThis dataset collects the slides that were presented at the Data Collaborations Across Boundaries session in SciDataCon 2022, part of the International Data Week.
The following session proposal was prepared by Tyng-Ruey Chuang and submitted to SciDataCon 2022 organizers for consideration on 2022-02-28. The proposal was accepted on 2022-03-28. Six abstracts were submitted and accepted to this session. Five presentations were delivered online in a virtual session on 2022-06-21.
Data Collaborations Across Boundaries
There are many good stories about data collaborations across boundaries. We need more. We also need to share the lessons each of us has learned from collaborating with parties and communities not in our familiar circles.
By boundaries, we mean not just the regulatory borders in between the nation states about data sharing but the various barriers, readily conceivable or not, that hinder collaboration in aggregating, sharing, and reusing data for social good. These barriers to collaboration exist between the academic disciplines, between the economic players, and between the many user communities, just to name a few. There are also cross-domain barriers, for example those that lay among data practitioners, public administrators, and policy makers when they are articulating the why, what, and how of "open data" and debating its economic significance and fair distribution. This session aims to bring together experiences and thoughts on good data practices in facilitating collaborations across boundaries and domains.
The success of Wikipedia proves that collaborative content production and service, by ways of copyleft licenses, can be sustainable when coordinated by a non-profit and funded by the general public. Collaborative code repositories like GitHub and GitLab demonstrate the enormous value and mass scale of systems-facilitated integration of user contributions that run across multiple programming languages and developer communities. Research data aggregators and repositories such as GBIF, GISAID, and Zenodo have served numerous researchers across academic disciplines. Citizen science projects and platforms, for instance eBird, Galaxy Zoo, and Taiwan Roadkill Observation Network (TaiRON), not only collect data from diverse communities but also manage and release datasets for research use and public benefit (e.g. TaiRON datasets being used to improve road design and reduce animal mortality). At the same time large scale data collaborations depend on standards, protocols, and tools for building registries (e.g. Archival Resource Key), ontologies (e.g. Wikidata and schema.org), repositories (e.g. CKAN and Omeka), and computing services (e.g. Jupyter Notebook). There are many types of data collaborations. The above lists only a few.
This session proposal calls for contributions to bring forward lessons learned from collaborative data projects and platforms, especially about those that involve multiple communities and/or across organizational boundaries. Presentations focusing on the following (non-exclusive) topics are sought after:
Support mechanisms and governance structures for data collaborations across organizations/communities.
Data policies --- such as data sharing agreements, memorandum of understanding, terms of use, privacy policies, etc. --- for facilitating collaborations across organizations/communities.
Traditional and non-traditional funding sources for data collaborations across multiple parties; sustainability of data collaboration projects, platforms, and communities.
Data workflows --- collection, processing, aggregation, archiving, and publishing, etc. --- designed with considerations of (external) collaboration.
Collaborative web platforms for data acquisition, curation, analysis, visualization, and education.
Examples and insights from data trusts, data coops, as well as other formal and informal forms of data stewardship.
Debates on the pros and cons of centralized, distributed, and/or federated data services.
Practical lessons learned from data collaboration stories: failure, success, incidence, unexpected turn of event, aftermath, etc. (no story is too small!).
Facebook
Twitterhttps://qdr.syr.edu/policies/qdr-standard-access-conditionshttps://qdr.syr.edu/policies/qdr-standard-access-conditions
Project Summary As part of a qualitative study of abortion reporting in the United States, the research team conducted cognitive interviews to iteratively assess new question wording and introductions designed to improve the accuracy of abortion reporting in surveys (to be shared on the Qualitative Data Repository in a separate submission). As expectations to share the data that underlie research increase, understanding how participants, particularly those taking part in qualitative research, respond to requests for data sharing is necessary. We assessed research participants’ willingness to, understanding of, and motivations for data sharing. Data Overview The data consist of excerpts from cognitive interviews with 64 cisgender women in two states in January and February of 2020 in which researchers asked for respondents for consent to share de-identified data. Eligibility criteria included: assigned female at birth, currently identified as a woman between the ages of 18-49, English-speaking, and reported ever having penile-vaginal sex. Respondents were screened for abortion history as well to ensure that at least half the sample reported a prior abortion. At the end of interviews, participants were asked to reflect on their motivations for agreeing or declining to share their data. The data included here are coded excerpts of their answers. Most respondents consented to data sharing, citing helping others as a primary motivation for agreeing to share their data. However, a substantial number of participants demonstrated limited understanding of “data sharing.” Data available here include the following materials: overview of methods, cognitive interview consent form (with language for data sharing consent), and data sharing analysis coding scheme.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was used as a data corpus for a bibliometric analysis with the title "Unveiling Openness in Energy Research: A Bibliometric Analysis Focusing on Open Access and Data Sharing Practices". The CSV file (2024-12-06_OpenAlex_API_download_works_Energy_Germany_(2013-2023)) was collected on December 6th, 2024, by using the OpenAlex API and search criteria: OpenAlex field "Energy", continent “Europe”, country “Germany”, and publication years 2013 – 2023. Based on this file, two sample files were extracted - one by subfield (2024-12-06_OpenAlex_API_dwonload_works_Energy_Germany_(2013-2023)_sampled_by_subfield) and another by year group (2024-12-06_OpenAlex_API_download_works_Energy_Germany_(2013-2023)_sampled_by_year_group). This dataset was collected and used to answer the following research questions: - What percentage of energy research publications are OA? How do the types (gold, green, etc.) of these publications differ? - Are there notable differences in OA and data sharing practices in different subfields of energy research? - How commonly are datasets for energy studies shared? What are the primary repositories used? - What kind of data sharing or publication practices are widespread? How has this evolved over the last decade?
Facebook
TwitterThis dataset contains anonymized transcripts of interviews performed during a qualitative interview study with members of 16 research funding agencies. Sample: Funding agencies were located mainly, but not exclusively, in Europe. Funding agencies were selected using a purposive sampling strategy that aimed to acquire sufficient representation of (a) national and international agencies; (b) public and philanthropic agencies; and (c) agencies in continental European and the anglophone world. Suitable funding agencies were selected through various means, such as a list of health research funding organizations according to their annual expenditure on health research (https://www.healthresearchfunders.org/health-research-funding-organizations/) and Science Europe Working Groups. Background: Open science policy documents have emphasized the need to install more incentives for data sharing. These incentives are often understood as being reputational or financial. Additionally, there are other policy measures that could be taken, such as data sharing mandates. Aim: To document views of funding agencies on (1) potential alterations to recognition systems in academia; (2) incentives to enhance the sharing of cohort data; (3) data sharing policies in terms of the governance of cohort data; (4) other potential interactions between science policy and data sharing platforms for cohorts. Our study focused on the sharing of patient- and population-based cohorts through data infrastructures.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Objective(s): Momentum for open access to research is growing. Funding agencies and publishers are increasingly requiring researchers make their data and research outputs open and publicly available. However, clinical researchers struggle to find real-world examples of Open Data sharing. The aim of this 1 hr virtual workshop is to provide real-world examples of Open Data sharing for both qualitative and quantitative data. Specifically, participants will learn: 1. Primary challenges and successes when sharing quantitative and qualitative clinical research data. 2. Platforms available for open data sharing. 3. Ways to troubleshoot data sharing and publish from open data. Workshop Agenda: 1. “Data sharing during the COVID-19 pandemic” - Speaker: Srinivas Murthy, Clinical Associate Professor, Department of Pediatrics, Faculty of Medicine, University of British Columbia. Investigator, BC Children's Hospital 2. “Our experience with Open Data for the 'Integrating a neonatal healthcare package for Malawi' project.” - Speaker: Maggie Woo Kinshella, Global Health Research Coordinator, Department of Obstetrics and Gynaecology, BC Children’s and Women’s Hospital and University of British Columbia This workshop draws on work supported by the Digital Research Alliance of Canada. Data Description: Presentation slides, Workshop Video, and Workshop Communication Srinivas Murthy: Data sharing during the COVID-19 pandemic presentation and accompanying PowerPoint slides. Maggie Woo Kinshella: Our experience with Open Data for the 'Integrating a neonatal healthcare package for Malawi' project presentation and accompanying Powerpoint slides. This workshop was developed as part of Dr. Ansermino's Data Champions Pilot Project supported by the Digital Research Alliance of Canada. NOTE for restricted files: If you are not yet a CoLab member, please complete our membership application survey to gain access to restricted files within 2 business days. Some files may remain restricted to CoLab members. These files are deemed more sensitive by the file owner and are meant to be shared on a case-by-case basis. Please contact the CoLab coordinator on this page under "collaborate with the pediatric sepsis colab."
Facebook
TwitterSummary Over the past decade, many scholarly journals have adopted policies on data sharing, with an increasing number of journals requiring that authors share the data underlying their published work. Frequently, qualitative data are excluded from those policies explicitly or implicitly. A few journals, however, intentionally do not make such a distinction. This project focuses on articles published in eight of the open-access journals maintained by Public Library of Science (PLOS). All PLOS journals introduced strict data sharing guidelines in 2014, applying to all empirical data on the basis of which articles are published. We collected a database of more than 2,300 articles containing a qualitative data component published between January 1, 2015 and August 23, 2023 and analyzed the data availability statements (DAS) researchers made regarding the availability, or lack thereof, of their data. We describe the degree to which and manner in which data are reportedly available (for example, in repositories, via institutional gate-keepers, or on request from author) versus those that are declared to be unavailable We also outline several dimensions of patterned variation in the data availability statements, including describe temporal patterns and variation by data type. Based on the results, we also provide recommendations to both researchers on how to make their data availability statements clearer, more transparent and more informative, and to journal editors and reviewers, on how to interpret and evaluate statements to ensure they accurately reflect a given data availability scenario. Finally, we suggest a workflow which can link interactions with repositories most productively as part of a typical editorial process. Data Overview This data deposit includes data and code to assemble the dataset, generate all figures and values used in the paper and appendix, and generate the codebook. It also includes the codebook and the figures. The analysis.R script and the data in data/analysis are sufficient to reproduce all findings in the paper. The additional scripts and the data files in data/raw are included for full transparency and to facilitate the detection of any errors in the data processing pipeline. Their structure is due to the development of the project over time.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The dataset was collected from Flickr in 2008, crawling about 400,000 pictures and related label description and user information. Image description and user information are organized as csv files. The original images are stored in jpg format. The dataset contains three files. Imagelist contains picture list and record of pictures signed as user' favorite. final_feture_tag contains picture's label. Note: The dataset is a sample version, after unzip, the file size is about 61.8M.
Facebook
Twitterhttp://rightsstatements.org/vocab/InC/1.0/http://rightsstatements.org/vocab/InC/1.0/
In the current scientific publishing landscape, there is a need for an authoring workflow that easily integrates data and code into manuscripts and that enables the data and code to be published in reusable form. Automated embedding of data and code into published output will enable superior communication and data archiving. In this work, we demonstrate a proof of concept for a workflow, org-mode, which successfully provides this authoring capability and workflow integration. We illustrate this concept in a series of examples for potential uses of this workflow. First, we use data on citation counts to compute the h-index of an author, and show two code examples for calculating the h-index. The source for each example is automatically embedded in the PDF during the export of the document. We demonstrate how data can be embedded in image files, which themselves are embedded in the document. Finally, metadata about the embedded files can be automatically included in the exported PDF, and accessed by computer programs. In our customized export, we embedded metadata about the attached files in the PDF in an Info field. A computer program could parse this output to get a list of embedded files and carry out analyses on them. Authoring tools such as Emacs + org-mode can greatly facilitate the integration of data and code into technical writing. These tools can also automate the embedding of data into document formats intended for consumption.
Facebook
TwitterBackground: Multiple Sclerosis Partners Advancing Technology and Health Solutions (MS PATHS) is the first example of a learning health system in multiple sclerosis (MS). This paper describes the initial implementation of MS PATHS and initial patient characteristics.Methods: MS PATHS is an ongoing initiative conducted in 10 healthcare institutions in three countries, each contributing standardized information acquired during routine care. Institutional participation required the following: active MS patient census of ≥500, at least one Siemens 3T magnetic resonance imaging scanner, and willingness to standardize patient assessments, share standardized data for research, and offer universal enrolment to capture a representative sample. The eligible participants have diagnosis of MS, including clinically isolated syndrome, and consent for sharing pseudonymized data for research. MS PATHS incorporates a self-administered patient assessment tool, the Multiple Sclerosis Performance Test, to collect a structured history, patient-reported outcomes, and quantitative testing of cognition, vision, dexterity, and walking speed. Brain magnetic resonance imaging is acquired using standardized acquisition sequences on Siemens 3T scanners. Quantitative measures of brain volume and lesion load are obtained. Using a separate consent, the patients contribute DNA, RNA, and serum for future research. The clinicians retain complete autonomy in using MS PATHS data in patient care. A shared governance model ensures transparent data and sample access for research.Results: As of August 5, 2019, MS PATHS enrolment included participants (n = 16,568) with broad ranges of disease subtypes, duration, and severity. Overall, 14,643 (88.4%) participants contributed data at one or more time points. The average patient contributed 15.6 person-months of follow-up (95% CI: 15.5–15.8); overall, 166,158 person-months of follow-up have been accumulated. Those with relapsing–remitting MS demonstrated more demographic heterogeneity than the participants in six randomized phase 3 MS treatment trials. Across sites, a significant variation was observed in the follow-up frequency and the patterns of disease-modifying therapy use.Conclusions: Through digital health technology, it is feasible to collect standardized, quantitative, and interpretable data from each patient in busy MS practices, facilitating the merger of research and patient care. This approach holds promise for data-driven clinical decisions and accelerated systematic learning.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Collaboratory is a software product developed and maintained by HandsOn Connect Cloud Solutions. It is intended to help higher education institutions accurately and comprehensively track their relationships with the community through engagement and service activities. Institutions that use Collaboratory are given the option to opt-in to a data sharing initiative at the time of onboarding, which grants us permission to de-identify their data and make it publicly available for research purposes. HandsOn Connect is committed to making Collaboratory data accessible to scholars for research, toward the goal of advancing the field of community engagement and social impact.Collaboratory is not a survey, but is instead a dynamic software tool designed to facilitate comprehensive, longitudinal data collection on community engagement and public service activities conducted by faculty, staff, and students in higher education. We provide a standard questionnaire that was developed by Collaboratory’s co-founders (Janke, Medlin, and Holland) in the Institute for Community and Economic Engagement at UNC Greensboro, which continues to be closely monitored and adapted by staff at HandsOn Connect and academic colleagues. It includes descriptive characteristics (what, where, when, with whom, to what end) of activities and invites participants to periodically update their information in accordance with activity progress over time. Examples of individual questions include the focus areas addressed, populations served, on- and off-campus collaborators, connections to teaching and research, and location information, among others.The Collaboratory dataset contains data from 45 institutions beginning in March 2016 and continues to grow as more institutions adopt Collaboratory and continue to expand its use. The data represent over 6,200 published activities (and additional associated content) across our user base.Please cite this data as:Medlin, Kristin and Singh, Manmeet. Dataset on Higher Education Community Engagement and Public Service Activities, 2016-2023. Collaboratory [producer], 2021. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2023-07-07. https://doi.org/10.3886/E136322V1When you cite this data, please also include: Janke, E., Medlin, K., & Holland, B. (2021, November 9). To What End? Ten Years of Collaboratory. https://doi.org/10.31219/osf.io/a27nb
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description:
The "Daily Social Media Active Users" dataset provides a comprehensive and dynamic look into the digital presence and activity of global users across major social media platforms. The data was generated to simulate real-world usage patterns for 13 popular platforms, including Facebook, YouTube, WhatsApp, Instagram, WeChat, TikTok, Telegram, Snapchat, X (formerly Twitter), Pinterest, Reddit, Threads, LinkedIn, and Quora. This dataset contains 10,000 rows and includes several key fields that offer insights into user demographics, engagement, and usage habits.
Dataset Breakdown:
Platform: The name of the social media platform where the user activity is tracked. It includes globally recognized platforms, such as Facebook, YouTube, and TikTok, that are known for their large, active user bases.
Owner: The company or entity that owns and operates the platform. Examples include Meta for Facebook, Instagram, and WhatsApp, Google for YouTube, and ByteDance for TikTok.
Primary Usage: This category identifies the primary function of each platform. Social media platforms differ in their primary usage, whether it's for social networking, messaging, multimedia sharing, professional networking, or more.
Country: The geographical region where the user is located. The dataset simulates global coverage, showcasing users from diverse locations and regions. It helps in understanding how user behavior varies across different countries.
Daily Time Spent (min): This field tracks how much time a user spends on a given platform on a daily basis, expressed in minutes. Time spent data is critical for understanding user engagement levels and the popularity of specific platforms.
Verified Account: Indicates whether the user has a verified account. This feature mimics real-world patterns where verified users (often public figures, businesses, or influencers) have enhanced status on social media platforms.
Date Joined: The date when the user registered or started using the platform. This data simulates user account history and can provide insights into user retention trends or platform growth over time.
Context and Use Cases:
Researchers, data scientists, and developers can use this dataset to:
Model User Behavior: By analyzing patterns in daily time spent, verified status, and country of origin, users can model and predict social media engagement behavior.
Test Analytics Tools: Social media monitoring and analytics platforms can use this dataset to simulate user activity and optimize their tools for engagement tracking, reporting, and visualization.
Train Machine Learning Algorithms: The dataset can be used to train models for various tasks like user segmentation, recommendation systems, or churn prediction based on engagement metrics.
Create Dashboards: This dataset can serve as the foundation for creating user-friendly dashboards that visualize user trends, platform comparisons, and engagement patterns across the globe.
Conduct Market Research: Business intelligence teams can use the data to understand how various demographics use social media, offering valuable insights into the most engaged regions, platform preferences, and usage behaviors.
Sources of Inspiration: This dataset is inspired by public data from industry reports, such as those from Statista, DataReportal, and other market research platforms. These sources provide insights into the global user base and usage statistics of popular social media platforms. The synthetic nature of this dataset allows for the use of realistic engagement metrics without violating any privacy concerns, making it an ideal tool for educational, analytical, and research purposes.
The structure and design of the dataset are based on real-world usage patterns and aim to represent a variety of users from different backgrounds, countries, and activity levels. This diversity makes it an ideal candidate for testing data-driven solutions and exploring social media trends.
Future Considerations:
As the social media landscape continues to evolve, this dataset can be updated or extended to include new platforms, engagement metrics, or user behaviors. Future iterations may incorporate features like post frequency, follower counts, engagement rates (likes, comments, shares), or even sentiment analysis from user-generated content.
By leveraging this dataset, analysts and data scientists can create better, more effective strategies ...
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
We designed and organized a one-day workshop, where in the context of FAIR the following themes were discussed and practiced: scientific transparency and reproducibility; how to write a README; data and code licenses; spatial data; programming code; examples of published datasets; data reuse; and discipline and motivation. The intended audience were researchers at the Environmental Science Group of Wageningen University and Research. All workshop materials were designed with further development and reuse in mind and are shared through this dataset.
Facebook
TwitterDerived from over 150 years of lexical research, these comprehensive textual and audio data, focused on American English, provide linguistically annotated data. Ideal for NLP applications, LLM training and/or fine-tuning, as well as educational and game apps.
One of our flagship datasets, the American English data is expertly curated and linguistically annotated by professionals, with annual updates to ensure accuracy and relevance. The below datasets in American English are available for license:
Key Features (approximate numbers):
Our American English Monolingual Dictionary Data is the foremost authority on American English, including detailed tagging and labelling covering parts of speech (POS), grammar, region, register, and subject, providing rich linguistic information. Additionally, all grammar and usage information is present to ensure relevance and accuracy.
The American English Synonyms and Antonyms Dataset is a leading resource offering comprehensive, up-to-date coverage of word relationships in contemporary American English. It includes rich linguistic details such as precise definitions and part-of-speech (POS) tags, making it an essential asset for developing AI systems and language technologies that require deep semantic understanding.
This dataset provides IPA transcriptions and clean audio data in contemporary American English. It includes syllabified transcriptions, variant spellings, POS tags, and pronunciation group identifiers. The audio files are supplied separately and linked where available for seamless integration - perfect for teams building TTS systems, ASR models, and pronunciation engines.
Use Cases:
We consistently work with our clients on new use cases as language technology continues to evolve. These include NLP applications, TTS, dictionary display tools, games, translation machine, AI training and fine-tuning, word embedding, and word sense disambiguation (WSD).
If you have a specific use case in mind that isn't listed here, we’d be happy to explore it with you. Don’t hesitate to get in touch with us at Growth.OL@oup.com to start the conversation.
Pricing:
Oxford Languages offers flexible pricing based on use case and delivery format. Our datasets are licensed via term-based IP agreements and tiered pricing for API-delivered data. Whether you’re integrating into a product, training an LLM, or building custom NLP solutions, we tailor licensing to your specific needs.
Contact our team or email us at Growth.OL@oup.com to explore pricing options and discover how our language data can support your goals. Please note that some datasets may have rights restrictions. Contact us for more information.
About the sample:
To help you explore the structure and features of our dataset on this platform, we provide a sample in CSV and/or JSON formats for one of the presented datasets, for preview purposes only, as shown on this page. This sample offers a quick and accessible overview of the data's contents and organization.
Our full datasets are available in various formats, depending on the language and type of data you require. These may include XML, JSON, TXT, XLSX, CSV, WAV, MP3, and other file types. Please contact us (Growth.OL@oup.com) if you would like to receive the original sample with full details.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
The COVID-19 pandemic has brought substantial attention to the systems used to communicate biomedical research. In particular, the need to rapidly and credibly communicate research findings has led many stakeholders to encourage researchers to adopt open science practices such as posting preprints and sharing data. To examine the degree to which this has led to the actual adoption of such practices, we examined the “openness” of a sample of 539 published papers describing the results of randomized controlled trials testing interventions to prevent or treat COVID-19. The majority (56%) of the papers in this sample were free to read at the time of our investigation and 23.56% were preceded by preprints. However, there is no guarantee that the papers without an open license will be available without a subscription in the future, and only 49.61% of the preprints we identified were linked to the subsequent peer-reviewed version. Of the 331 papers in our sample with statements identifying if (and how) related datasets were available, only a paucity indicated that data was available in a repository that facilitates rapid verification and reuse. Our results demonstrate that, while progress has been made, there is still a significant mismatch between aspiration and actual practice in the adoption of open science in an important area of the COVID-19 literature. Methods Data was pulled from publicly available sources (e.g. Unpaywall) and from manual inspection of the papers themselves. Instructions for members of the research team have been uploaded to Protocols.io - https://doi.org/10.17504/protocols.io.x54v9jx7zg3e
Facebook
Twitter
According to our latest research, the AI-Generated Synthetic Tabular Dataset market size reached USD 1.42 billion in 2024 globally, reflecting the rapid adoption of artificial intelligence-driven data generation solutions across numerous industries. The market is expected to expand at a robust CAGR of 34.7% from 2025 to 2033, reaching a forecasted value of USD 19.17 billion by 2033. This exceptional growth is primarily driven by the increasing need for high-quality, privacy-preserving datasets for analytics, model training, and regulatory compliance, particularly in sectors with stringent data privacy requirements.
One of the principal growth factors propelling the AI-Generated Synthetic Tabular Dataset market is the escalating demand for data-driven innovation amidst tightening data privacy regulations. Organizations across healthcare, finance, and government sectors are facing mounting challenges in accessing and sharing real-world data due to GDPR, HIPAA, and other global privacy laws. Synthetic data, generated by advanced AI algorithms, offers a solution by mimicking the statistical properties of real datasets without exposing sensitive information. This enables organizations to accelerate AI and machine learning development, conduct robust analytics, and facilitate collaborative research without risking data breaches or non-compliance. The growing sophistication of generative models, such as GANs and VAEs, has further increased confidence in the utility and realism of synthetic tabular data, fueling adoption across both large enterprises and research institutions.
Another significant driver is the surge in digital transformation initiatives and the proliferation of AI and machine learning applications across industries. As businesses strive to leverage predictive analytics, automation, and intelligent decision-making, the need for large, diverse, and high-quality datasets has become paramount. However, real-world data is often siloed, incomplete, or inaccessible due to privacy concerns. AI-generated synthetic tabular datasets bridge this gap by providing scalable, customizable, and bias-mitigated data for model training and validation. This not only accelerates AI deployment but also enhances model robustness and generalizability. The flexibility of synthetic data generation platforms, which can simulate rare events and edge cases, is particularly valuable in sectors like finance and healthcare, where such scenarios are underrepresented in real datasets but critical for risk assessment and decision support.
The rapid evolution of the AI-Generated Synthetic Tabular Dataset market is also underpinned by technological advancements and growing investments in AI infrastructure. The availability of cloud-based synthetic data generation platforms, coupled with advancements in natural language processing and tabular data modeling, has democratized access to synthetic datasets for organizations of all sizes. Strategic partnerships between technology providers, research institutions, and regulatory bodies are fostering innovation and establishing best practices for synthetic data quality, utility, and governance. Furthermore, the integration of synthetic data solutions with existing data management and analytics ecosystems is streamlining workflows and reducing barriers to adoption, thereby accelerating market growth.
Regionally, North America dominates the AI-Generated Synthetic Tabular Dataset market, accounting for the largest share in 2024 due to the presence of leading AI technology firms, strong regulatory frameworks, and early adoption across industries. Europe follows closely, driven by stringent data protection laws and a vibrant research ecosystem. The Asia Pacific region is emerging as a high-growth market, fueled by rapid digitalization, government initiatives, and increasing investments in AI research and development. Latin America and the Middle East & Africa are also witnessing growing interest, particularly in sectors like finance and government, though market maturity varies across countries. The regional landscape is expected to evolve dynamically as regulatory harmonization, cross-border data collaboration, and technological advancements continue to shape market trajectories globally.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
## ## The original dataset has been released in three versions of KuaiRand for different uses:
1. KuaiRand-27K (23GB logs +23GB features): the complete KuaiRand dataset that has over 27K users and 32 million videos. Can be downloaded by : wget https://zenodo.org/records/10439422/files/KuaiRand-27K.tar.gz command
2. KuaiRand-1K (829MB logs + 3.5GB features): randomly sample 1,000 users from KuaiRand-27K, then remove all irrelevant videos. There are 4 million videos rest.Can be downloaded by : wget https://zenodo.org/records/10439422/files/KuaiRand-1K.tar.gz command
3. KuaiRand-Pure (184MB logs + 10MB features): only keeps the logs for the 7583 videos in the candidate pool. (Uploaded in this page data)
There are three log files in each version e.g in KuaiRand-Pure:
- log_random_4_22_to_5_08.csv contains all interactions resulting from random intervention.
- log_standard_4_22_to_5_08.csv contains all interactions of standard recommendation.
- log_standard_4_08_to_4_21.csv contains all interactions of standard recommendation for the same users in the previous two weeks (2022.04.08 ~ 2022.04.21).
Complete files and features description in: https://kuairand.com/
1. Reasons to use KuaiRand-27K or KuaiRand-1K: - Your research needs rigorous sequential logs, such as off-policy evaluation (OPE), Reinforcement learning (RL), or long sequential recommendation.
2. Reasons to use KuaiRand-Pure: - The sequential information is not necessary for your research OR If you are OK with the incomplete sequential logs. For example, if you are studying debiasing in collaborative filtering models or multi-task modeling in recommendation. - If your model can only run with small-size data.
Chongming Gao et al, 2022. KuaiRand: An Unbiased Sequential Recommendation Dataset with Randomly Exposed Videos
Compared with other datasets with random exposure, KuaiRand has the following advantages:
✅ It is the first sequential recommendation dataset with millions of intervened interactions of randomly exposed items inserted in the standard recommendation feeds.
✅ It has the most comprehensive side information including explicit user IDs, interaction timestamps, and rich features for users and items.
✅ It has 15 policies with each catered for a special recommendation scenario in the Kuaishou App.
✅ introduced by 12 feedback signals (e.g., click, like, and view time) for each interaction to describe the user’s comprehensive feedback.
✅ Each user has thousands of historical interactions on average.
✅ It has three versions to support various research directions in recommendation.
Recommender systems suffer from various biases in the data collection stage . Most existing datasets are very sparse and affected by user-selection bias or exposure bias . It is of critical importance to develop models that can alleviate biases. To evaluate the models, we need reliable unbiased data. KuaiRand is the first dataset that inserts the random items into the normal recommendation feeds with rich side information and all item/user IDs provided. With this authentic unbiased data, we can evaluate and thus improve the recommender policy.
KuaiRand can further support the following promising research directions in recommendation.
- Off-policy Evaluation (OPE)
- Interactive Recommendation
- Long Sequential Behavior Modeling
- Multi-Task Learning
- Bias and Debias in Recommender System: A Survey and Future Directions
- "https://arxiv.org/pdf/2308.01118.pdf">A Survey on Popularity Bias in Recommender Systems
Facebook
Twitterhttps://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.11588/DATA/FJAG0Xhttps://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.11588/DATA/FJAG0X
Background Data sharing is commonly seen as beneficial for science, but is not yet common practice. Research funding agencies are known to play a key role in promoting data sharing, but German funders’ data sharing policies appear to lag behind in international comparison. This study aims to answer the question of how German data sharing experts inside and outside funding agencies perceive and evaluate German funders’ data sharing policies and overall efforts to promote data sharing. Methods This study is based on sixteen guideline-structured interviews with representatives of German funding agencies and German research data experts from other organisations, who shared their perceptions of German’ funders efforts to promote data sharing. By applying the method of qualitative content analysis to our interview data, we categorise and describe noteworthy aspects of the German data sharing policy landscape and illustrate our findings with interview passages. Research data This dataset contains summaries from interviews with data sharing and funding policy experts from German funding agencies and what we call "stakeholder organisations" (e.g., universities, research data infrastructure providers, etc.). We asked the interviewees about their perspectives on German funders' data sharing policies, for example regarding the actual status quo, their expectations about the potential role that funders can play in promoting data sharing, as well as general developments in this area. Supplement_1_Interview_guideline_funders.pdf and Supplement_2_Interview_guideline_stakeholders.pdf provide supplemental information in the form of the (german) interview guidelines used in this study. Supplement_3_Transcription_and_coding_guideline.pdf lays out the rules we followed in our transcription and coding process. Supplement_4_Category_system.pdf describes the underlying category system of the qualitative content analysis we conducted.