43 datasets found

Z
CA-SUM pretrained models
data.niaid.nih.gov
zenodo.org
Updated May 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Balaouras, Georgios (2022). CA-SUM pretrained models [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6562991
Explore at:
Dataset updated
May 20, 2022
Dataset provided by
Patras, Ioannis
Balaouras, Georgios
Mezaris, Vasileios
Apostolidis, Evlampios
Description
This dataset contains pretrained models of the CA-SUM network architecture for video summarization, that is presented in our work titled “Summarizing Videos using Concentrated Attention and Considering the Uniqueness and Diversity of the Video Frames”, in Proc. ACM ICMR 2022.

Method overview:

In our ICMR 2022 paper we describe a new method for unsupervised video summarization. To overcome limitations of existing unsupervised video summarization approaches, that relate to the unstable training of Generator-Discriminator architectures, the use of RNNs for modeling long-range frames' dependencies and the ability to parallelize the training process of RNN-based network architectures, the developed method relies solely on the use of a self-attention mechanism to estimate the importance of video frames. Instead of simply modeling the frames' dependencies based on global attention, our method integrates a concentrated attention mechanism that is able to focus on non-overlapping blocks in the main diagonal of the attention matrix, and to enrich the existing information by extracting and exploiting knowledge about the uniqueness and diversity of the associated frames of the video. In this way, our method makes better estimates about the significance of different parts of the video, and drastically reduces the number of learnable parameters. Experimental evaluations using two benchmarking datasets (SumMe and TVSum) show the competitiveness of the proposed method against other state-of-the-art unsupervised summarization approaches, and demonstrate its ability to produce video summaries that are very close to the human preferences. An ablation study that focuses on the introduced components, namely the use of concentrated attention in combination with attention-based estimates about the frames' uniqueness and diversity, shows their relative contributions to the overall summarization performance.

File format:

The “pretrained_models.zip“ file that is provided in the present zenodo page contains a set of pretrained models of the CA-SUM network architecture. After downloading and unpacking this file, in the created “pretrained_models” folder, you will find two sub-directories one per each of the utilized benchmarking datasets (SumMe and TVSum) in our experimental evaluations. Within each of these sub-directories we provide the pretrained model (.pt file) for each data-split (split0-split4), where the naming of the provided .pt file indicates the training epoch and the value of the length regularization factor of the selected pretrained model.

The models have been trained in a full-batch mode (i.e., batch size is equal to the number of training samples) and were automatically selected after the end of the training process, based on a methodology that relies on transductive inference (described in Section 4.2 of [1]). Finally, the data-splits we used for performing inference on the provided pretrained models, and the source code that can be used for training your own models of the proposed CA-SUM network architecture, can be found at: https://github.com/e-apostolidis/CA-SUM.

License and Citation:

These resources are provided for academic, non-commercial use only. If you find these resources useful in your work, please cite the following publication where they are introduced:

E. Apostolidis, G. Balaouras, V. Mezaris, and I. Patras. 2022, “Summarizing Videos using Concentrated Attention and Considering the Uniqueness and Diversity of the Video Frames”, Proc. of the 2022 Int. Conf. on Multimedia Retrieval (ICMR ’22), June 2022, Newark, NJ, USA. https://doi.org/10.1145/3512527.3531404 Software available at: https://github.com/e-apostolidis/CA-SUM
Data from: Dataset construction method of cross-lingual summarization based...
zenodo.org
Updated Mar 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hangyu Pan; Hangyu Pan; Yaoyi Xi; Yaoyi Xi; Ling Wang; Ling Wang; Yu Nan; Zhizhong Su; Rong Cao; Yu Nan; Zhizhong Su; Rong Cao (2023). Dataset construction method of cross-lingual summarization based on filtering and text augmentation [Dataset]. http://doi.org/10.5281/zenodo.7694044
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.7694044
Dataset updated
Mar 4, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Hangyu Pan; Hangyu Pan; Yaoyi Xi; Yaoyi Xi; Ling Wang; Ling Wang; Yu Nan; Zhizhong Su; Rong Cao; Yu Nan; Zhizhong Su; Rong Cao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The NCLS dataset is provided by its authors (Zhu et al.): https://drive.google.com/file/d/1GZpKkHnTH_1Wxiti0BrrxPm18y9rTQRL/view. We work on the train set, validation set, and manually corrected test set.
Z
Data from: Trends in anesthesiology research: a machine learning approach to...
data.niaid.nih.gov
Updated May 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Miotto, Riccardo (2022). Data from: Trends in anesthesiology research: a machine learning approach to theme discovery and summarization [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4989411
Explore at:
Dataset updated
May 28, 2022
Dataset provided by
Rusanov, Alexander
Weng, Chunhua
Miotto, Riccardo
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Objectives: Traditionally, summarization of research themes and trends within a given discipline was accomplished by manual review of scientific works in the field. However, with the ushering in of the age of "big data", new methods for discovery of such information become necessary as traditional techniques become increasingly difficult to apply due to the exponential growth of document repositories. Our objectives are to develop a pipeline for unsupervised theme extraction and summarization of thematic trends in document repositories, and to test it by applying it to a specific domain. Methods: To that end, we detail a pipeline, which utilizes machine learning and natural language processing for unsupervised theme extraction, and a novel method for summarization of thematic trends, and network mapping for visualization of thematic relations. We then apply this pipeline to a collection of anesthesiology abstracts. Results: We demonstrate how this pipeline enables discovery of major themes and temporal trends in anesthesiology research and facilitates document classification and corpus exploration. Discussion: The relation of prevalent topics and extracted trends to recent events in both anesthesiology, and healthcare in general, demonstrates the pipeline's utility. Furthermore, the agreement between the unsupervised thematic grouping and human-assigned classification validates the pipeline's accuracy and demonstrates another potential use. Conclusion: The described pipeline enables summarization and exploration of large document repositories, facilitates classification, aids in trend identification. A more robust and user-friendly interface will facilitate the expansion of this methodology to other domains. This will be the focus of future work for our group.
Semantic Summarization for Context Aware Manipulation of Data, Phase II
data.nasa.gov
application/rdfxml +5
Updated Jun 26, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). Semantic Summarization for Context Aware Manipulation of Data, Phase II [Dataset]. https://data.nasa.gov/d/vcqh-dx6v
Explore at:
csv, xml, tsv, application/rssxml, json, application/rdfxmlAvailable download formats
Dataset updated
Jun 26, 2018
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
NASA's exploration and scientific missions will produce terabytes of information. As NASA enters a new phase of space exploration, managing large amounts of scientific and operational data will become even more challenging. Robots conducting planetary exploration will produce data for selection and preparation of exploration sites. Robots and space probes will collect scientific data to improve understanding of the solar system. Satellites in low Earth orbit will collect data for monitoring changes in the Earth's atmosphere and surface environment. Key challenges for all these missions are understanding and summarizing what data have been collected and using this knowledge to improve data access. TRACLabs and CMU propose to develop context aware image manipulation software for managing data collected remotely during NASA missions. This software will filter and search large image archives using the temporal and spatial characteristics of images, and the robotic, instrument, and environmental conditions when images were taken. It also will implement techniques for finding which images show a terrain feature specified by the user. In Phase II we will implement this software and evaluate its effectiveness for NASA missions. At the end of Phase II, context aware image manipulation software at TRL 5-6 will be delivered to NASA.
TVSum (TVSum: Summarizing web videos using titles)
opendatalab.com
zip
Updated Mar 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yahoo Research (2023). TVSum (TVSum: Summarizing web videos using titles) [Dataset]. https://opendatalab.com/OpenDataLab/TVSum
Explore at:
zip(697042813 bytes)Available download formats
Dataset updated
Mar 17, 2023
Dataset provided by
雅虎研究院
欧特巴https://tw.yahoo.com/
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
Title-based Video Summarization (TVSum) dataset serves as a benchmark to validate video summarization techniques. It contains 50 videos of various genres (e.g., news, how-to, documentary, vlog, egocentric) and 1,000 annotations of shot-level importance scores obtained via crowdsourcing (20 per video). The video and annotation data permits an automatic evaluation of various video summarization techniques, without having to conduct (expensive) user study.
Z
TestDescriber: First Release
data.niaid.nih.gov
zenodo.org
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
apanichella (2020). TestDescriber: First Release [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_45120
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
inventitech
panichella
apanichella
azaidman
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Replication Package for the paper entitled: "The Impact of Test Case Summaries on Bug Fixing Performance: An Empirical Investigation"

Abstract

Automated test generation tools have been widely investigated with the goal of reducing the cost of testing activities. However, generated tests have been shown not to help developers in detecting and finding more bugs even though they reach higher structural coverage compared to manual testing. The main reason is that generated tests are difficult to understand and maintain. Our paper proposes an approach, coined TestScribe, which automatically generates test case summaries of the portion of code exercised by each individual test, thereby improving understandability. We argue that this approach can complement the current techniques around automated unit test generation or search-based techniques designed to generate a possibly minimal set of test cases. In evaluating our approach we found that (1) developers find twice as many bugs, and (2) test case summaries significantly improve the comprehensibility of test cases, which are considered particularly useful by developers.

This repository provides the replication package with (i) material and working data sets of our study, (ii) complete results of the SURVEY; and (iii) rawdata for replication purposes and to support future studies. A detailed description of the contents is included in README.txt.
W
Data from: State-of-the-art report summarizing techniques to determine...
cloud.csiss.gmu.edu
data.wu.ac.at
html
Updated Aug 8, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Energy Data Exchange (2019). State-of-the-art report summarizing techniques to determine residual oil saturation and recommendations on the requirements for residual oil saturation research and development [Dataset]. https://cloud.csiss.gmu.edu/uddi/dataset/state-of-the-art-report-summarizing-techniques-to-determine-residual-oil-saturation-and-recomme
Explore at:
htmlAvailable download formats
Dataset updated
Aug 8, 2019
Dataset provided by
Energy Data Exchange
Description
An investigation was conducted on the residual oil saturation (ROS) measurement techniques developed during the last fifteen years. Knowledge of precise ROS measurements is required for EOR project planning. The advantages, limitations, and problems of each one of the techniques are presented in tabulated form. Also, some of the possible improvements in the measurement techniques for the residual oil saturation are summarized. The following residual oil saturation techniques are discussed: core analyses, well logging, backflow tracer tests, material balance and well testing, newly developed gravity log methods, and interwell residual oil saturation measurements. Several aspects left to be improved in both instrumentations and data interpretation on pressure coring, back-flow tracer tests, well logging, material balance calculations, well testing, and interwell ROS measurements are presented. A nuclear magnetism log-inject-log method is proposed in which the need for porosity measurement for determining residual oil saturation is eliminated. 91 refs., 3 tabs.
f
Data from: Model Interpretation Through Lower-Dimensional Posterior...
tandf.figshare.com
zip
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Spencer Woody; Carlos M. Carvalho; Jared S. Murray (2023). Model Interpretation Through Lower-Dimensional Posterior Summarization [Dataset]. http://doi.org/10.6084/m9.figshare.12844520.v3
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12844520.v3
Dataset updated
Jun 1, 2023
Dataset provided by
Taylor & Francis
Authors
Spencer Woody; Carlos M. Carvalho; Jared S. Murray
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Nonparametric regression models have recently surged in their power and popularity, accompanying the trend of increasing dataset size and complexity. While these models have proven their predictive ability in empirical settings, they are often difficult to interpret and do not address the underlying inferential goals of the analyst or decision maker. In this article, we propose a modular two-stage approach for creating parsimonious, interpretable summaries of complex models which allow freedom in the choice of modeling technique and the inferential target. In the first stage, a flexible model is fit which is believed to be as accurate as possible. In the second stage, lower-dimensional summaries are constructed by projecting draws from the distribution onto simpler structures. These summaries naturally come with valid Bayesian uncertainty estimates. Further, since we use the data only once to move from prior to posterior, these uncertainty estimates remain valid across multiple summaries and after iteratively refining a summary. We apply our method and demonstrate its strengths across a range of simulated and real datasets. The methods we present here are implemented in an R package available at github.com/spencerwoody/possum. Supplementary materials for this article are available online.
d
5.12 Cybersecurity (summary) - Archived
catalog.data.gov
performance.tempe.gov
+5more
Updated Jan 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Tempe (2025). 5.12 Cybersecurity (summary) - Archived [Dataset]. https://catalog.data.gov/dataset/5-12-cybersecurity-summary-823d7
Explore at:
Dataset updated
Jan 17, 2025
Dataset provided by
City of Tempe
Description
The National Institute of Standards and Technology (NIST) provides a Cybersecurity Framework (CSF) for benchmarking and measuring the maturity level of cyber security programs across all industries. The City uses this framework and toolset to measure and report on its internal cyber security program.The foundation for this measure is the Framework Core, a set of cybersecurity activities, desired outcomes and applicable references that are common across critical infrastructure/industry sectors. These activities come from the National Institute of Standards and Technology (NIST) Cybersecurity Framework (CSF) published standard, along with the information security and customer privacy controls it references (NIST 800 Series Special Publications). The Framework Core presents industry standards, guidelines, and practices in a manner that allows for communication of cybersecurity activities and outcomes across the organization from the executive level to the implementation/operations level. The Framework Core consists of five concurrent and continuous functions – identify, protect, detect, respond, and recover. When considered together, these functions provide a high-level, strategic view of the lifecycle of an organization’s management of cybersecurity risk. The Framework Core identifies underlying key categories and subcategories for each function, and matches them with example references, such as existing standards, guidelines and practices for each subcategory. This page provides data for the Cybersecurity performance measure.Cybersecurity Framework cumulative score summary per fiscal year quarter (Performance Measure 5.12)The performance measure page is available at 5.12 Cybersecurity.Additional InformationSource: Maturity assessment / https://www.nist.gov/topics/cybersecurityContact: Scott CampbellContact E-Mail: Scott_Campbell@tempe.govData Source Type: ExcelPreparation Method: The data is a summary of a detailed and confidential analysis of the city's cyber security program. Maturity scores of subcategories within NIST CFS are combined, averaged and rolled up to a summary score for each major category.Publish Frequency: AnnualPublish Method: ManualData Dictionary
f
Comparison with some standard methods.
plos.figshare.com
xls
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abdulkadir Abubakar Bichi; Ruhaidah Samsudin; Rohayanti Hassan; Layla Rasheed Abdallah Hasan; Abubakar Ado Rogo (2023). Comparison with some standard methods. [Dataset]. http://doi.org/10.1371/journal.pone.0285376.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0285376.t004
Dataset updated
May 30, 2023
Dataset provided by
PLOS ONE
Authors
Abdulkadir Abubakar Bichi; Ruhaidah Samsudin; Rohayanti Hassan; Layla Rasheed Abdallah Hasan; Abubakar Ado Rogo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Automatic text summarization is one of the most promising solutions to the ever-growing challenges of textual data as it produces a shorter version of the original document with fewer bytes, but the same information as the original document. Despite the advancements in automatic text summarization research, research involving the development of automatic text summarization methods for documents written in Hausa, a Chadic language widely spoken in West Africa by approximately 150,000,000 people as either their first or second language, is still in early stages of development. This study proposes a novel graph-based extractive single-document summarization method for Hausa text by modifying the existing PageRank algorithm using the normalized common bigrams count between adjacent sentences as the initial vertex score. The proposed method is evaluated using a primarily collected Hausa summarization evaluation dataset comprising of 113 Hausa news articles on ROUGE evaluation toolkits. The proposed approach outperformed the standard methods using the same datasets. It outperformed the TextRank method by 2.1%, LexRank by 12.3%, centroid-based method by 19.5%, and BM25 method by 17.4%.
A
‘3.07 AZ Merit Data (summary)’ analyzed by Analyst-2
analyst-2.ai
Updated Oct 6, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2018). ‘3.07 AZ Merit Data (summary)’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/data-gov-3-07-az-merit-data-summary-6762/6d861231/?iid=004-634&v=presentation
Explore at:
Dataset updated
Oct 6, 2018
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘3.07 AZ Merit Data (summary)’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://catalog.data.gov/dataset/fb4f97b2-e1f3-4397-ac7d-f0487f989272 on 11 February 2022.

--- Dataset description provided by original source is as follows ---

This page provides data for the 3rd Grade Reading Level Proficiency performance measure.

The dataset includes the student performance results on the English/Language Arts section of the AzMERIT from the Fall 2017 and Spring 2018. Data is representive of students in third grade in public elementary schools in Tempe. This includes schools from both Tempe Elementary and Kyrene districts. Results are by school and provide the total number of students tested, total percentage passing and percentage of students scoring at each of the four levels of proficiency.

The performance measure dashboard is available at 3.07 3rd Grade Reading Level Proficiency.

Additional Information
Source: Arizona Department of Education
Contact: Ann Lynn DiDomenico
Contact E-Mail: Ann_DiDomenico@tempe.gov
Data Source Type: Excel/ CSV
Preparation Method: Filters on original dataset: within "Schools" Tab School District [select Tempe School District and Kyrene School District]; School Name [deselect Kyrene SD not in Tempe city limits]; Content Area [select English Language Arts]; Test Level [select Grade 3]; Subgroup/Ethnicity [select All Students] Remove irrelevant fields; Add Fiscal Year
Publish Frequency: Annually as data becomes available
Publish Method: Manual
Data Dictionary

--- Original source retains full ownership of the source dataset ---
Reproduction Package for the FSE 2024 Paper "EyeTrans: Merging Human and...
zenodo.org
data.niaid.nih.gov
bin, zip
Updated Feb 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yifan Zhang; Yifan Zhang (2024). Reproduction Package for the FSE 2024 Paper "EyeTrans: Merging Human and Machine Attention for Neural Code Summarization" [Dataset]. http://doi.org/10.5061/dryad.w9ghx3fx9
Explore at:
zip, binAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.w9ghx3fx9
Dataset updated
Feb 20, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Yifan Zhang; Yifan Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 23, 2024
Description
This artifact accompanies our paper "EyeTrans: Merging Human and Machine Attention for Neural Code Summarization," which has been accepted for presentation at the ACM International Conference on the Foundations of Software Engineering (FSE) 2024.

The artifact contains the dataset derived from a human study using eye-tracking for code comprehension, crucial for the development of the EyeTrans model. Additionally, it includes the source code related to the research questions addressed within our work.

This includes the unprocessed data from the eye-tracking study, scripts for data processing, and the source code for the EyeTrans model, which merges human and machine attention within Transformer models. This resource is intended for researchers aiming to replicate our study, conduct further inquiry, or extend the techniques to new datasets in software engineering research.
Data Analysis Process of the Study: Key Information Summary, Dimensional...
figshare.com
docx
Updated Jun 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
RUI LI (2024). Data Analysis Process of the Study: Key Information Summary, Dimensional Analysis, and Content Analysis [Dataset]. http://doi.org/10.6084/m9.figshare.26012824.v2
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.26012824.v2
Dataset updated
Jun 27, 2024
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
RUI LI
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The data analysis process of this study involved three stages: key information summary, dimensional analysis, and content analysis.Key Information Summary: This phase focused on summarizing the key information of the literature. The content summary method was utilized to distill the main content, research themes, findings, and contributions of the literature, laying the foundation for subsequent in-depth analysis. Tables 3, 4, 5, 6, 7, and 8 were formed during this stage. Specific details of the tables are provided in the attachment.Dimensional Analysis: In this stage, thematic analysis was used to extract and identify key concepts and themes from the literature, exploring the relationship between self-directed learning (SDL) and deep learning. Tables 9 and 10 were formed during this stage.Content Analysis: The focus of this stage was on exploring how students experience and achieve deep learning within Self-Directed Learning. Content analysis aimed to deeply understand students' learning experiences, the learning strategies they adopt, and how these strategies facilitate deep learning. Tables 11 and 12 were formed during this stage.Attachments: Attach tables 3, 4, 5, 6, 7, 8, 9, 11, and 12 to the project for reference and further analysis.
f
Table_2_Deriving comprehensive literature trends on multi-omics analysis...
frontiersin.figshare.com
xlsx
Updated Nov 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dattatray Mongad; Indhupriya Subramanian; Anamika Krishanpal (2024). Table_2_Deriving comprehensive literature trends on multi-omics analysis studies in autism spectrum disorder using literature mining pipeline.XLSX [Dataset]. http://doi.org/10.3389/fnins.2024.1400412.s004
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/fnins.2024.1400412.s004
Dataset updated
Nov 12, 2024
Dataset provided by
Frontiers
Authors
Dattatray Mongad; Indhupriya Subramanian; Anamika Krishanpal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Autism spectrum disorder (ASD) is characterized by highly heterogenous abnormalities in functional brain connectivity affecting social behavior. There is a significant progress in understanding the molecular and genetic basis of ASD in the last decade using multi-omics approach. Mining this large volume of biomedical literature for insights requires considerable amount of manual intervention for curation. Machine learning and artificial intelligence fields are advancing toward simplifying data mining from unstructured text data. Here, we demonstrate our literature mining pipeline to accelerate data to insights. Using topic modeling and generative AI techniques, we present a pipeline that can classify scientific literature into thematic clusters and can help in a wide array of applications such as knowledgebase creation, conversational virtual assistant, and summarization. Employing our pipeline, we explored the ASD literature, specifically around multi-omics studies to understand the molecular interplay underlying autism brain.
d
Manufacturing: Summary Series: General Summary: Method of Inventory...
catalog.data.gov
Updated Sep 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Census Bureau (2023). Manufacturing: Summary Series: General Summary: Method of Inventory Valuation by Subsector and Industries: 2012 [Dataset]. https://catalog.data.gov/dataset/manufacturing-summary-series-general-summary-method-of-inventory-valuation-by-subsector-an
Explore at:
Dataset updated
Sep 7, 2023
Dataset provided by
U.S. Census Bureau
Description
Manufacturing: Summary Series: General Summary: Method of Inventory Valuation by Subsector and Industries: 2012.
f
Data from: Data_Sheet_1_An Active Data Representation of Videos for...
figshare.com
pdf
Updated Mar 6, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fasih Haider; Maria Koutsombogera; Owen Conlan; Carl Vogel; Nick Campbell; Saturnino Luz (2020). Data_Sheet_1_An Active Data Representation of Videos for Automatic Scoring of Oral Presentation Delivery Skills and Feedback Generation.PDF [Dataset]. http://doi.org/10.3389/fcomp.2020.00001.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fcomp.2020.00001.s001
Dataset updated
Mar 6, 2020
Dataset provided by
Frontiers
Authors
Fasih Haider; Maria Koutsombogera; Owen Conlan; Carl Vogel; Nick Campbell; Saturnino Luz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Public speaking is an important skill, the acquisition of which requires dedicated and time consuming training. In recent years, researchers have started to investigate automatic methods to support public speaking skills training. These methods include assessment of the trainee's oral presentation delivery skills which may be accomplished through automatic understanding and processing of social and behavioral cues displayed by the presenter. In this study, we propose an automatic scoring system for presentation delivery skills using a novel active data representation method to automatically rate segments of a full video presentation. While most approaches have employed a two step strategy consisting of detecting multiple events followed by classification, which involve the annotation of data for building the different event detectors and generating a data representation based on their output for classification, our method does not require event detectors. The proposed data representation is generated unsupervised using low-level audiovisual descriptors and self-organizing mapping and used for video classification. This representation is also used to analyse video segments within a full video presentation in terms of several characteristics of the presenter's performance. The audio representation provides the best prediction results for self-confidence and enthusiasm, posture and body language, structure and connection of ideas, and overall presentation delivery. The video data representation provides the best results for presentation of relevant information with good pronunciation, usage of language according to audience, and maintenance of adequate voice volume for the audience. The fusion of audio and video data provides the best results for eye contact. Applications of the method to provision of feedback to teachers and trainees are discussed.
Smart Contract Code Summarization Dataset
zenodo.org
data.niaid.nih.gov
zip
Updated Jun 26, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhen Yang; Zhen Yang (2021). Smart Contract Code Summarization Dataset [Dataset]. http://doi.org/10.5281/zenodo.4587089
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4587089
Dataset updated
Jun 26, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Zhen Yang; Zhen Yang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Paper has been accepted by ICPC'21.

If you find this dataset useful, please cite our paper, here's the link: https://arxiv.org/abs/2103.07164

The whole data includes:

(1) contracts: 347,410 smart contract

(2) dataset:

a. dictionaries: the dictionary of each sequence.

b. token_idx: each input that has translated to digital index.

c. dataset.pkl: 317,680 (SBT sequence, nodes equence, adjacency matrix, comment) tuples.
d
“DelRiv 24k – NE": Natural Environment Related Data Summaries for the...
catalog.data.gov
s.cnmilf.com
+1more
Updated Feb 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). “DelRiv 24k – NE": Natural Environment Related Data Summaries for the Delaware River Watershed Within NHD Plus HR catchments [Dataset]. https://catalog.data.gov/dataset/delriv-24k-ne-natural-environment-related-data-summaries-for-the-delaware-river-watershed-
Explore at:
Dataset updated
Feb 22, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Delaware River
Description
These tabular data are the summarization of natural environment related variables within catchments of the Delaware River watershed at the 1:24,000 scale using the xstrm methodology. Variables being counted as natural environment related include soils/geology, lithology, elevation, slope, stream gradient, landform (geomorphon) and others. Outputs include tabular comma-separated values files (CSVs) and parquet files for the local catchment and network summaries linked to the National Hydrography Dataset Plus High-Resolution (NHDPlus HR) catchments by NHDPlus ID. Local catchments are defined as the single catchment within which the data are summarized. Network summaries are summaries for each of the local catchments and their respective network-connected upstream catchments for select variables. The summarized data tables are structured as a single column representing the catchment id values (i.e. NHDPlus ID) and the remaining columns consisting of the summarized variables. Xstrm downstream network summaries are not present within this dataset as no summaries were conducted using that network summary method. For a full description of the variables included within these summaries see xstrm_nhdhr_natural_delaware_river_datadictionary.csv in the attached files. The xstrm local summary methodology takes either raster or point data as input then summarizes those data by "zones", in this case the NHDPlus HR catchments. The network summaries then take the results from the local summaries and calculates the desired network summary statistic for the local catchment and its respective upstream or downstream catchments. As a note concerning use of these data, any rasters summarized within this process only had their cells included within a catchment if the center of the raster cell fell within the catchment boundary. However, the resolution of the input raster data for these summaries was considered to provide completely adequate coverage of the summary catchments using this option. If a confirmed complete coverage of a catchment is desired (even if a raster cell only is minimally included within the catchment) then it is recommended to rerun the xstrm summary process with the "All Touched" option set to “True”. Further information on the Xstrm summary process can be found at the Xstrm software release pages: Xstrm: Wieferich, D.J., Williams, B., Falgout, J.T., Foks, N.L. 2021. xstrm. U.S. Geological Survey software release. https://doi.org/10.5066/P9P8P7Z0. Xstrm Local: Wieferich, D.J., Gressler B., Krause K., Wieczorek M., McDonald, S. 2022. xstrm_local Version-1.1.0. U.S. Geological Survey software release. https://doi.org/10.5066/P98BOGI9.
f
Table_1_Deriving comprehensive literature trends on multi-omics analysis...
frontiersin.figshare.com
figshare.com
xlsx
Updated Nov 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dattatray Mongad; Indhupriya Subramanian; Anamika Krishanpal (2024). Table_1_Deriving comprehensive literature trends on multi-omics analysis studies in autism spectrum disorder using literature mining pipeline.XLSX [Dataset]. http://doi.org/10.3389/fnins.2024.1400412.s003
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/fnins.2024.1400412.s003
Dataset updated
Nov 12, 2024
Dataset provided by
Frontiers
Authors
Dattatray Mongad; Indhupriya Subramanian; Anamika Krishanpal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Autism spectrum disorder (ASD) is characterized by highly heterogenous abnormalities in functional brain connectivity affecting social behavior. There is a significant progress in understanding the molecular and genetic basis of ASD in the last decade using multi-omics approach. Mining this large volume of biomedical literature for insights requires considerable amount of manual intervention for curation. Machine learning and artificial intelligence fields are advancing toward simplifying data mining from unstructured text data. Here, we demonstrate our literature mining pipeline to accelerate data to insights. Using topic modeling and generative AI techniques, we present a pipeline that can classify scientific literature into thematic clusters and can help in a wide array of applications such as knowledgebase creation, conversational virtual assistant, and summarization. Employing our pipeline, we explored the ASD literature, specifically around multi-omics studies to understand the molecular interplay underlying autism brain.
f
Data_Sheet_1_“R” U ready?: a case study using R to analyze changes in gene...
frontiersin.figshare.com
docx
Updated Mar 22, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amy E. Pomeroy; Andrea Bixler; Stefanie H. Chen; Jennifer E. Kerr; Todd D. Levine; Elizabeth F. Ryder (2024). Data_Sheet_1_“R” U ready?: a case study using R to analyze changes in gene expression during evolution.docx [Dataset]. http://doi.org/10.3389/feduc.2024.1379910.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/feduc.2024.1379910.s001
Dataset updated
Mar 22, 2024
Dataset provided by
Frontiers
Authors
Amy E. Pomeroy; Andrea Bixler; Stefanie H. Chen; Jennifer E. Kerr; Todd D. Levine; Elizabeth F. Ryder
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
As high-throughput methods become more common, training undergraduates to analyze data must include having them generate informative summaries of large datasets. This flexible case study provides an opportunity for undergraduate students to become familiar with the capabilities of R programming in the context of high-throughput evolutionary data collected using macroarrays. The story line introduces a recent graduate hired at a biotech firm and tasked with analysis and visualization of changes in gene expression from 20,000 generations of the Lenski Lab’s Long-Term Evolution Experiment (LTEE). Our main character is not familiar with R and is guided by a coworker to learn about this platform. Initially this involves a step-by-step analysis of the small Iris dataset built into R which includes sepal and petal length of three species of irises. Practice calculating summary statistics and correlations, and making histograms and scatter plots, prepares the protagonist to perform similar analyses with the LTEE dataset. In the LTEE module, students analyze gene expression data from the long-term evolutionary experiments, developing their skills in manipulating and interpreting large scientific datasets through visualizations and statistical analysis. Prerequisite knowledge is basic statistics, the Central Dogma, and basic evolutionary principles. The Iris module provides hands-on experience using R programming to explore and visualize a simple dataset; it can be used independently as an introduction to R for biological data or skipped if students already have some experience with R. Both modules emphasize understanding the utility of R, rather than creation of original code. Pilot testing showed the case study was well-received by students and faculty, who described it as a clear introduction to R and appreciated the value of R for visualizing and analyzing large datasets.

Facebook

Twitter

Click to copy link

Link copied

Cite

Balaouras, Georgios (2022). CA-SUM pretrained models [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6562991

CA-SUM pretrained models

Explore at:

Dataset updated

May 20, 2022

Dataset provided by

Patras, Ioannis
Balaouras, Georgios
Mezaris, Vasileios
Apostolidis, Evlampios

Description

This dataset contains pretrained models of the CA-SUM network architecture for video summarization, that is presented in our work titled “Summarizing Videos using Concentrated Attention and Considering the Uniqueness and Diversity of the Video Frames”, in Proc. ACM ICMR 2022.

Method overview:

In our ICMR 2022 paper we describe a new method for unsupervised video summarization. To overcome limitations of existing unsupervised video summarization approaches, that relate to the unstable training of Generator-Discriminator architectures, the use of RNNs for modeling long-range frames' dependencies and the ability to parallelize the training process of RNN-based network architectures, the developed method relies solely on the use of a self-attention mechanism to estimate the importance of video frames. Instead of simply modeling the frames' dependencies based on global attention, our method integrates a concentrated attention mechanism that is able to focus on non-overlapping blocks in the main diagonal of the attention matrix, and to enrich the existing information by extracting and exploiting knowledge about the uniqueness and diversity of the associated frames of the video. In this way, our method makes better estimates about the significance of different parts of the video, and drastically reduces the number of learnable parameters. Experimental evaluations using two benchmarking datasets (SumMe and TVSum) show the competitiveness of the proposed method against other state-of-the-art unsupervised summarization approaches, and demonstrate its ability to produce video summaries that are very close to the human preferences. An ablation study that focuses on the introduced components, namely the use of concentrated attention in combination with attention-based estimates about the frames' uniqueness and diversity, shows their relative contributions to the overall summarization performance.

File format:

The “pretrained_models.zip“ file that is provided in the present zenodo page contains a set of pretrained models of the CA-SUM network architecture. After downloading and unpacking this file, in the created “pretrained_models” folder, you will find two sub-directories one per each of the utilized benchmarking datasets (SumMe and TVSum) in our experimental evaluations. Within each of these sub-directories we provide the pretrained model (.pt file) for each data-split (split0-split4), where the naming of the provided .pt file indicates the training epoch and the value of the length regularization factor of the selected pretrained model.

The models have been trained in a full-batch mode (i.e., batch size is equal to the number of training samples) and were automatically selected after the end of the training process, based on a methodology that relies on transductive inference (described in Section 4.2 of [1]). Finally, the data-splits we used for performing inference on the provided pretrained models, and the source code that can be used for training your own models of the proposed CA-SUM network architecture, can be found at: https://github.com/e-apostolidis/CA-SUM.

License and Citation:

These resources are provided for academic, non-commercial use only. If you find these resources useful in your work, please cite the following publication where they are introduced:

E. Apostolidis, G. Balaouras, V. Mezaris, and I. Patras. 2022, “Summarizing Videos using Concentrated Attention and Considering the Uniqueness and Diversity of the Video Frames”, Proc. of the 2022 Int. Conf. on Multimedia Retrieval (ICMR ’22), June 2022, Newark, NJ, USA. https://doi.org/10.1145/3512527.3531404 Software available at: https://github.com/e-apostolidis/CA-SUM

Clear search

Close search

Google apps

Main menu

CA-SUM pretrained models

Data from: Dataset construction method of cross-lingual summarization based...

Data from: Trends in anesthesiology research: a machine learning approach to...

Semantic Summarization for Context Aware Manipulation of Data, Phase II

TVSum (TVSum: Summarizing web videos using titles)

TestDescriber: First Release

Data from: State-of-the-art report summarizing techniques to determine...

Data from: Model Interpretation Through Lower-Dimensional Posterior...

5.12 Cybersecurity (summary) - Archived

Comparison with some standard methods.

‘3.07 AZ Merit Data (summary)’ analyzed by Analyst-2

Reproduction Package for the FSE 2024 Paper "EyeTrans: Merging Human and...

Data Analysis Process of the Study: Key Information Summary, Dimensional...

Table_2_Deriving comprehensive literature trends on multi-omics analysis...

Manufacturing: Summary Series: General Summary: Method of Inventory...

Data from: Data_Sheet_1_An Active Data Representation of Videos for...

Smart Contract Code Summarization Dataset

“DelRiv 24k – NE": Natural Environment Related Data Summaries for the...

Table_1_Deriving comprehensive literature trends on multi-omics analysis...

Data_Sheet_1_“R” U ready?: a case study using R to analyze changes in gene...

CA-SUM pretrained models