100+ datasets found
  1. Data from: Current and projected research data storage needs of Agricultural...

    • catalog.data.gov
    • agdatacommons.nal.usda.gov
    • +2more
    Updated Apr 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Current and projected research data storage needs of Agricultural Research Service researchers in 2016 [Dataset]. https://catalog.data.gov/dataset/current-and-projected-research-data-storage-needs-of-agricultural-research-service-researc-f33da
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Agricultural Research Servicehttps://www.ars.usda.gov/
    Description

    The USDA Agricultural Research Service (ARS) recently established SCINet , which consists of a shared high performance computing resource, Ceres, and the dedicated high-speed Internet2 network used to access Ceres. Current and potential SCINet users are using and generating very large datasets so SCINet needs to be provisioned with adequate data storage for their active computing. It is not designed to hold data beyond active research phases. At the same time, the National Agricultural Library has been developing the Ag Data Commons, a research data catalog and repository designed for public data release and professional data curation. Ag Data Commons needs to anticipate the size and nature of data it will be tasked with handling. The ARS Web-enabled Databases Working Group, organized under the SCINet initiative, conducted a study to establish baseline data storage needs and practices, and to make projections that could inform future infrastructure design, purchases, and policies. The SCINet Web-enabled Databases Working Group helped develop the survey which is the basis for an internal report. While the report was for internal use, the survey and resulting data may be generally useful and are being released publicly. From October 24 to November 8, 2016 we administered a 17-question survey (Appendix A) by emailing a Survey Monkey link to all ARS Research Leaders, intending to cover data storage needs of all 1,675 SY (Category 1 and Category 4) scientists. We designed the survey to accommodate either individual researcher responses or group responses. Research Leaders could decide, based on their unit's practices or their management preferences, whether to delegate response to a data management expert in their unit, to all members of their unit, or to themselves collate responses from their unit before reporting in the survey. Larger storage ranges cover vastly different amounts of data so the implications here could be significant depending on whether the true amount is at the lower or higher end of the range. Therefore, we requested more detail from "Big Data users," those 47 respondents who indicated they had more than 10 to 100 TB or over 100 TB total current data (Q5). All other respondents are called "Small Data users." Because not all of these follow-up requests were successful, we used actual follow-up responses to estimate likely responses for those who did not respond. We defined active data as data that would be used within the next six months. All other data would be considered inactive, or archival. To calculate per person storage needs we used the high end of the reported range divided by 1 for an individual response, or by G, the number of individuals in a group response. For Big Data users we used the actual reported values or estimated likely values. Resources in this dataset:Resource Title: Appendix A: ARS data storage survey questions. File Name: Appendix A.pdfResource Description: The full list of questions asked with the possible responses. The survey was not administered using this PDF but the PDF was generated directly from the administered survey using the Print option under Design Survey. Asterisked questions were required. A list of Research Units and their associated codes was provided in a drop down not shown here. Resource Software Recommended: Adobe Acrobat,url: https://get.adobe.com/reader/ Resource Title: CSV of Responses from ARS Researcher Data Storage Survey. File Name: Machine-readable survey response data.csvResource Description: CSV file includes raw responses from the administered survey, as downloaded unfiltered from Survey Monkey, including incomplete responses. Also includes additional classification and calculations to support analysis. Individual email addresses and IP addresses have been removed. This information is that same data as in the Excel spreadsheet (also provided).Resource Title: Responses from ARS Researcher Data Storage Survey. File Name: Data Storage Survey Data for public release.xlsxResource Description: MS Excel worksheet that Includes raw responses from the administered survey, as downloaded unfiltered from Survey Monkey, including incomplete responses. Also includes additional classification and calculations to support analysis. Individual email addresses and IP addresses have been removed.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel

  2. Parameterizing Spatial Models of Infectious Disease Transmission that...

    • plos.figshare.com
    pdf
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rajat Malik; Rob Deardon; Grace P. S. Kwong (2023). Parameterizing Spatial Models of Infectious Disease Transmission that Incorporate Infection Time Uncertainty Using Sampling-Based Likelihood Approximations [Dataset]. http://doi.org/10.1371/journal.pone.0146253
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Rajat Malik; Rob Deardon; Grace P. S. Kwong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A class of discrete-time models of infectious disease spread, referred to as individual-level models (ILMs), are typically fitted in a Bayesian Markov chain Monte Carlo (MCMC) framework. These models quantify probabilistic outcomes regarding the risk of infection of susceptible individuals due to various susceptibility and transmissibility factors, including their spatial distance from infectious individuals. The infectious pressure from infected individuals exerted on susceptible individuals is intrinsic to these ILMs. Unfortunately, quantifying this infectious pressure for data sets containing many individuals can be computationally burdensome, leading to a time-consuming likelihood calculation and, thus, computationally prohibitive MCMC-based analysis. This problem worsens when using data augmentation to allow for uncertainty in infection times. In this paper, we develop sampling methods that can be used to calculate a fast, approximate likelihood when fitting such disease models. A simple random sampling approach is initially considered followed by various spatially-stratified schemes. We test and compare the performance of our methods with both simulated data and data from the 2001 foot-and-mouth disease (FMD) epidemic in the U.K. Our results indicate that substantial computation savings can be obtained—albeit, of course, with some information loss—suggesting that such techniques may be of use in the analysis of very large epidemic data sets.

  3. h

    the-stack

    • huggingface.co
    • opendatalab.com
    Updated Oct 27, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BigCode (2022). the-stack [Dataset]. https://huggingface.co/datasets/bigcode/the-stack
    Explore at:
    Dataset updated
    Oct 27, 2022
    Dataset authored and provided by
    BigCode
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for The Stack

      Changelog
    

    Release Description

    v1.0 Initial release of the Stack. Included 30 programming languages and 18 permissive licenses. Note: Three included licenses (MPL/EPL/LGPL) are considered weak copyleft licenses. The resulting near-deduplicated dataset is 3TB in size.

    v1.1 The three copyleft licenses ((MPL/EPL/LGPL) were excluded and the list of permissive licenses extended to 193 licenses in total. The list of programming languages… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack.

  4. Z

    Data from: An Open-set Recognition and Few-Shot Learning Dataset for Audio...

    • data.niaid.nih.gov
    • zenodo.org
    Updated May 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pedro Zuccarello (2024). An Open-set Recognition and Few-Shot Learning Dataset for Audio Event Classification in Domestic Environments [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3689287
    Explore at:
    Dataset updated
    May 21, 2024
    Dataset provided by
    Sergi Perez-Castanos
    Maximo Cobos
    Javier Naranjo-Alcazar
    Pedro Zuccarello
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The problem of training a deep neural network with a small set of positive samples is known as few-shot learning (FSL). It is widely known that traditional deep learning (DL) algorithms usually show very good performance when trained with large datasets. However, in many applications, it is not possible to obtain such a high number of samples. In the image domain, typical FSL applications are those related to face recognition. In the audio domain, music fraud or speaker recognition can be clearly benefited from FSL methods. This paper deals with the application of FSL to the detection of specific and intentional acoustic events given by different types of sound alarms, such as door bells or fire alarms, using a limited number of samples. These sounds typically occur in domestic environments where many events corresponding to a wide variety of sound classes take place. Therefore, the detection of such alarms in a practical scenario can be considered an open-set recognition (OSR) problem. To address the lack of a dedicated public dataset for audio FSL, researchers usually make modifications on other available datasets. This paper is aimed at providing the audio recognition community with a carefully annotated dataset for FSL and OSR comprised of 1360 clips from 34 classes divided into pattern sounds and unwanted sounds. To facilitate and promote research in this area, results with two baseline systems (one trained from scratch and another based on transfer learning), are presented.

  5. u

    Data from: MobileWell400+: A Large-Scale Multivariate Longitudinal Mobile...

    • produccioncientifica.ucm.es
    • zenodo.org
    Updated 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Banos, Oresti; Damas, Miguel; Goicoechea, Carmen; Perakakis, Pandelis; Pomares, Hector; Rodriguez-Leon, Ciro; Sanabria, Daniel; Villalonga, Claudia; Banos, Oresti; Damas, Miguel; Goicoechea, Carmen; Perakakis, Pandelis; Pomares, Hector; Rodriguez-Leon, Ciro; Sanabria, Daniel; Villalonga, Claudia (2024). MobileWell400+: A Large-Scale Multivariate Longitudinal Mobile Dataset for Investigating Individual and Collective Well-Being [Dataset]. https://produccioncientifica.ucm.es/documentos/668fc499b9e7c03b01be2372
    Explore at:
    Dataset updated
    2024
    Authors
    Banos, Oresti; Damas, Miguel; Goicoechea, Carmen; Perakakis, Pandelis; Pomares, Hector; Rodriguez-Leon, Ciro; Sanabria, Daniel; Villalonga, Claudia; Banos, Oresti; Damas, Miguel; Goicoechea, Carmen; Perakakis, Pandelis; Pomares, Hector; Rodriguez-Leon, Ciro; Sanabria, Daniel; Villalonga, Claudia
    Description

    This study engaged 409 participants over a period spanning from July 10 to August 8, 2023, ensuring representation across various demographic factors: 221 females, 186 males, 2 non-binary, year of birth between 1951 and 2005, with varied annual incomes and from 15 Spanish regions. The MobileWell400+ dataset, openly accessible, encompasses a wide array of data collected via the participants' mobile phone, including demographic, emotional, social, behavioral, and well-being data. Methodologically, the project presents a promising avenue for uncovering new social, behavioral, and emotional indicators, supplementing existing literature. Notably, artificial intelligence is considered to be instrumental in analysing these data, discerning patterns, and forecasting trends, thereby advancing our comprehension of individual and population well-being. Ethical standards were upheld, with participants providing informed consent.

    The following is a non-exhaustive list of collected data:

    Data continuously collected through the participants' smartphone sensors: physical activity (resting, walking, driving, cycling, etc.), name of detected WiFi networks, connectivity type (WiFi, mobile, none), ambient light, ambient noise, and status of the device screen (on, off, locked, unlocked).

    Data corresponding to an initial survey prompted via the smartphone, with information related to demographic data, effects and COVID vaccination, average hours of physical activity, and answers to a series of questions to measure mental health, many of them taken from internationally recognised psychological and well-being scales (PANAS, PHQ, GAD, BRS and AAQ), social isolation (TILS) and economic inequality perception.

    Data corresponding to daily surveys prompted via the smartphone, where variables related to mood (valence, activation, energy and emotional events) and social interaction (quantity and quality) are measured.

    Data corresponding to weekly surveys prompted via the smartphone, where information on overall health, hours of physical activity per week, lonileness, and questions related to well-being are asked.

    Data corresponding to an final survey prompted via the smartphone, consisting of similar questions to the ones asked in the initial survey, namely psychological and well-being items (PANAS, PHQ, GAD, BRS and AAQ), social isolation (TILS) and economic inequality perception questions.

    For a more detailed description of the study please refer to MobileWell400+StudyDescription.pdf.

    For a more detailed description of the collected data, variables and data files please refer to MobileWell400+FilesDescription.pdf.

  6. P

    Dataset for the Article "Does the Venue of Scientific Conferences Leverage...

    • paperswithcode.com
    • opendatalab.com
    Updated May 30, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). Dataset for the Article "Does the Venue of Scientific Conferences Leverage their Impact? A Large Scale study on Computer Science Dataset [Dataset]. https://paperswithcode.com/dataset/dataset-for-the-article-does-the-venue-of
    Explore at:
    Dataset updated
    May 30, 2021
    Description

    Is there any correlation between the impact of a scientific conference and the venue where it takes place? It seems that no one has tackled this issue before, so we decided to explore the possible implications. From the one hand, we considered the number of citations as indicator of the impact of a conference; from the other hand, we considered specific touristic indexes that characterize the venue. In this work we report on the results of the large scale analysis we conducted on the bibliographic data we extracted from nearly 4000 conference series in the Computer Science area and over 2.5 million papers spanning more than 30 years of research. Interestingly, we found out that the two aspects are indeed related and this is shown by the detailed analysis of the data.

  7. COVID-19 Case Surveillance Public Use Data

    • data.cdc.gov
    • paperswithcode.com
    • +5more
    application/rdfxml +5
    Updated Jul 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CDC Data, Analytics and Visualization Task Force (2024). COVID-19 Case Surveillance Public Use Data [Dataset]. https://data.cdc.gov/Case-Surveillance/COVID-19-Case-Surveillance-Public-Use-Data/vbim-akqf
    Explore at:
    application/rdfxml, tsv, csv, json, xml, application/rssxmlAvailable download formats
    Dataset updated
    Jul 9, 2024
    Dataset provided by
    Centers for Disease Control and Preventionhttp://www.cdc.gov/
    Authors
    CDC Data, Analytics and Visualization Task Force
    License

    https://www.usa.gov/government-workshttps://www.usa.gov/government-works

    Description

    Note: Reporting of new COVID-19 Case Surveillance data will be discontinued July 1, 2024, to align with the process of removing SARS-CoV-2 infections (COVID-19 cases) from the list of nationally notifiable diseases. Although these data will continue to be publicly available, the dataset will no longer be updated.

    Authorizations to collect certain public health data expired at the end of the U.S. public health emergency declaration on May 11, 2023. The following jurisdictions discontinued COVID-19 case notifications to CDC: Iowa (11/8/21), Kansas (5/12/23), Kentucky (1/1/24), Louisiana (10/31/23), New Hampshire (5/23/23), and Oklahoma (5/2/23). Please note that these jurisdictions will not routinely send new case data after the dates indicated. As of 7/13/23, case notifications from Oregon will only include pediatric cases resulting in death.

    This case surveillance public use dataset has 12 elements for all COVID-19 cases shared with CDC and includes demographics, any exposure history, disease severity indicators and outcomes, presence of any underlying medical conditions and risk behaviors, and no geographic data.

    CDC has three COVID-19 case surveillance datasets:

    The following apply to all three datasets:

    Overview

    The COVID-19 case surveillance database includes individual-level data reported to U.S. states and autonomous reporting entities, including New York City and the District of Columbia (D.C.), as well as U.S. territories and affiliates. On April 5, 2020, COVID-19 was added to the Nationally Notifiable Condition List and classified as “immediately notifiable, urgent (within 24 hours)” by a Council of State and Territorial Epidemiologists (CSTE) Interim Position Statement (Interim-20-ID-01). CSTE updated the position statement on August 5, 2020, to clarify the interpretation of antigen detection tests and serologic test results within the case classification (Interim-20-ID-02). The statement also recommended that all states and territories enact laws to make COVID-19 reportable in their jurisdiction, and that jurisdictions conducting surveillance should submit case notifications to CDC. COVID-19 case surveillance data are collected by jurisdictions and reported voluntarily to CDC.

    For more information: NNDSS Supports the COVID-19 Response | CDC.

    The deidentified data in the “COVID-19 Case Surveillance Public Use Data” include demographic characteristics, any exposure history, disease severity indicators and outcomes, clinical data, laboratory diagnostic test results, and presence of any underlying medical conditions and risk behaviors. All data elements can be found on the COVID-19 case report form located at www.cdc.gov/coronavirus/2019-ncov/downloads/pui-form.pdf.

    COVID-19 Case Reports

    COVID-19 case reports have been routinely submitted using nationally standardized case reporting forms. On April 5, 2020, CSTE released an Interim Position Statement with national surveillance case definitions for COVID-19 included. Current versions of these case definitions are available here: https://ndc.services.cdc.gov/case-definitions/coronavirus-disease-2019-2021/.

    All cases reported on or after were requested to be shared by public health departments to CDC using the standardized case definitions for laboratory-confirmed or probable cases. On May 5, 2020, the standardized case reporting form was revised. Case reporting using this new form is ongoing among U.S. states and territories.

    Data are Considered Provisional

    • The COVID-19 case surveillance data are dynamic; case reports can be modified at any time by the jurisdictions sharing COVID-19 data with CDC. CDC may update prior cases shared with CDC based on any updated information from jurisdictions. For instance, as new information is gathered about previously reported cases, health departments provide updated data to CDC. As more information and data become available, analyses might find changes in surveillance data and trends during a previously reported time window. Data may also be shared late with CDC due to the volume of COVID-19 cases.
    • Annual finalized data: To create the final NNDSS data used in the annual tables, CDC works carefully with the reporting jurisdictions to reconcile the data received during the year until each state or territorial epidemiologist confirms that the data from their area are correct.
    • Access Addressing Gaps in Public Health Reporting of Race and Ethnicity for COVID-19, a report from the Council of State and Territorial Epidemiologists, to better understand the challenges in completing race and ethnicity data for COVID-19 and recommendations for improvement.

    Data Limitations

    To learn more about the limitations in using case surveillance data, visit FAQ: COVID-19 Data and Surveillance.

    Data Quality Assurance Procedures

    CDC’s Case Surveillance Section routinely performs data quality assurance procedures (i.e., ongoing corrections and logic checks to address data errors). To date, the following data cleaning steps have been implemented:

    • Questions that have been left unanswered (blank) on the case report form are reclassified to a Missing value, if applicable to the question. For example, in the question “Was the individual hospitalized?” where the possible answer choices include “Yes,” “No,” or “Unknown,” the blank value is recoded to Missing because the case report form did not include a response to the question.
    • Logic checks are performed for date data. If an illogical date has been provided, CDC reviews the data with the reporting jurisdiction. For example, if a symptom onset date in the future is reported to CDC, this value is set to null until the reporting jurisdiction updates the date appropriately.
    • Additional data quality processing to recode free text data is ongoing. Data on symptoms, race and ethnicity, and healthcare worker status have been prioritized.

    Data Suppression

    To prevent release of data that could be used to identify people, data cells are suppressed for low frequency (<5) records and indirect identifiers (e.g., date of first positive specimen). Suppression includes rare combinations of demographic characteristics (sex, age group, race/ethnicity). Suppressed values are re-coded to the NA answer option; records with data suppression are never removed.

    For questions, please contact Ask SRRG (eocevent394@cdc.gov).

    Additional COVID-19 Data

    COVID-19 data are available to the public as summary or aggregate count files, including total counts of cases and deaths by state and by county. These

  8. Z

    MOBDrone: a large-scale drone-view dataset for man overboard detection

    • data.niaid.nih.gov
    • explore.openaire.eu
    • +1more
    Updated Jul 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrea Berton (2024). MOBDrone: a large-scale drone-view dataset for man overboard detection [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5996889
    Explore at:
    Dataset updated
    Jul 17, 2024
    Dataset provided by
    Chiara Benvenuti
    Andrea Berton
    Fabrizio Falchi
    Lucia Vadicamo
    Donato Cafarelli
    Marco Paterni
    Claudio Gennaro
    Luca Ciampi
    Mirko Passera
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset

    The Man OverBoard Drone (MOBDrone) dataset is a large-scale collection of aerial footage images. It contains 126,170 frames extracted from 66 video clips gathered from one UAV flying at an altitude of 10 to 60 meters above the mean sea level. Images are manually annotated with more than 180K bounding boxes localizing objects belonging to 5 categories --- person, boat, lifebuoy, surfboard, wood. More than 113K of these bounding boxes belong to the person category and localize people in the water simulating the need to be rescued.

    In this repository, we provide:

    66 Full HD video clips (total size: 5.5 GB)

    126,170 images extracted from the videos at a rate of 30 FPS (total size: 243 GB)

    3 annotation files for the extracted images that follow the MS COCO data format (for more info see https://cocodataset.org/#format-data):

    annotations_5_custom_classes.json: this file contains annotations concerning all five categories; please note that class ids do not correspond with the ones provided by the MS COCO standard since we account for two new classes not previously considered in the MS COCO dataset --- lifebuoy and wood

    annotations_3_coco_classes.json: this file contains annotations concerning the three classes also accounted by the MS COCO dataset --- person, boat, surfboard. Class ids correspond with the ones provided by the MS COCO standard.

    annotations_person_coco_classes.json: this file contains annotations concerning only the 'person' class. Class id corresponds to the one provided by the MS COCO standard.

    The MOBDrone dataset is intended as a test data benchmark. However, for researchers interested in using our data also for training purposes, we provide training and test splits:

    Test set: All the images whose filename starts with "DJI_0804" (total: 37,604 images)

    Training set: All the images whose filename starts with "DJI_0915" (total: 88,568 images)

    More details about data generation and the evaluation protocol can be found at our MOBDrone paper: https://arxiv.org/abs/2203.07973 The code to reproduce our results is available at this GitHub Repository: https://github.com/ciampluca/MOBDrone_eval See also http://aimh.isti.cnr.it/dataset/MOBDrone

    Citing the MOBDrone

    The MOBDrone is released under a Creative Commons Attribution license, so please cite the MOBDrone if it is used in your work in any form. Published academic papers should use the academic paper citation for our MOBDrone paper, where we evaluated several pre-trained state-of-the-art object detectors focusing on the detection of the overboard people

    @inproceedings{MOBDrone2021, title={MOBDrone: a Drone Video Dataset for Man OverBoard Rescue}, author={Donato Cafarelli and Luca Ciampi and Lucia Vadicamo and Claudio Gennaro and Andrea Berton and Marco Paterni and Chiara Benvenuti and Mirko Passera and Fabrizio Falchi}, booktitle={ICIAP2021: 21th International Conference on Image Analysis and Processing}, year={2021} }

    and this Zenodo Dataset

    @dataset{donato_cafarelli_2022_5996890, author={Donato Cafarelli and Luca Ciampi and Lucia Vadicamo and Claudio Gennaro and Andrea Berton and Marco Paterni and Chiara Benvenuti and Mirko Passera and Fabrizio Falchi}, title = {{MOBDrone: a large-scale drone-view dataset for man overboard detection}}, month = feb, year = 2022, publisher = {Zenodo}, version = {1.0.0}, doi = {10.5281/zenodo.5996890}, url = {https://doi.org/10.5281/zenodo.5996890} }

    Personal works, such as machine learning projects/blog posts, should provide a URL to the MOBDrone Zenodo page (https://doi.org/10.5281/zenodo.5996890), though a reference to our MOBDrone paper would also be appreciated.

    Contact Information

    If you would like further information about the MOBDrone or if you experience any issues downloading files, please contact us at mobdrone[at]isti.cnr.it

    Acknowledgements

    This work was partially supported by NAUSICAA - "NAUtical Safety by means of Integrated Computer-Assistance Appliances 4.0" project funded by the Tuscany region (CUP D44E20003410009). The data collection was carried out with the collaboration of the Fly&Sense Service of the CNR of Pisa - for the flight operations of remotely piloted aerial systems - and of the Institute of Clinical Physiology (IFC) of the CNR - for the water immersion operations.

  9. P

    BanglaBook Dataset

    • paperswithcode.com
    Updated May 10, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohsinul Kabir; Obayed Bin Mahfuz; Syed Rifat Raiyan; Hasan Mahmud; Md Kamrul Hasan (2023). BanglaBook Dataset [Dataset]. https://paperswithcode.com/dataset/banglabook
    Explore at:
    Dataset updated
    May 10, 2023
    Authors
    Mohsinul Kabir; Obayed Bin Mahfuz; Syed Rifat Raiyan; Hasan Mahmud; Md Kamrul Hasan
    Description

    This repository contains the code, data, and models of the paper titled "BᴀɴɢʟᴀBᴏᴏᴋ: A Large-scale Bangla Dataset for Sentiment Analysis from Book Reviews" published in the Findings of the Association for Computational Linguistics: ACL 2023.

    License: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International

    Data Format Each row consists of a book review sample. The table below describes what each column signifies.

    Column TitleDescription
    idThe unique identification number of the sample
    Book_NameThe title of the book that has been evaluated by the review
    Writer_NameThe name of the book's author
    CategoryThe genre to which the book belongs
    RatingA numerical value $r$ such that $1\leq r \leq 5$
    A score reflecting the reviewer's subjective assessment of the book's quality
    ReviewThe review text written by the reviewer
    SiteThe name of the online bookshop
    sentimentThe conveyed sentiment and class label of the review
    For a review sample $i$ with rating $r_i$, the sentiment label $S_i$ is,
    $S_i =\begin{cases}Negative, & \text{if $r_i \leq 2$}\Neutral, & \text{if $r_i = 3$}\Positive, & \text{if $r_i \geq 4$}\end{cases}$
    labelThe numerical representation of the sentiment label
    For a review sample $i$ with sentiment label $S_i$, the numerical label is,
    $label_i =\begin{cases}0, & \text{if $S_i = Negative$}\1, & \text{if $S_i = Neutral$}\2, & \text{if $S_i = Positive$}\end{cases}$

    Data Construction Data Collection Process For the data collection and preparation process of the BᴀɴɢʟᴀBᴏᴏᴋ dataset, we first compile a list of URLs for authors from online bookstores. From there, we procure URLs for the books. We meticulously scrape information such as book titles, author names, book categories, review texts, reviewer names, review dates, and ratings by utilizing these book URLs. https://github.com/mohsinulkabir14/BanglaBook/raw/main/images/banglabookgithub1.png" alt="drawing" style="width:1000px;"/>

    Labeling, Translation, and Validation of the Curated Samples If a review does not have a rating, we deem it unannotated. Reviews with a rating of 1 or 2 are classified as negative, a rating of 3 is considered neutral, and a rating of 4 or 5 is classified as positive. After discarding the unannotated reviews, we curate a final dataset of 158,065 annotated reviews. Of these, 89,371 are written entirely in Bangla. The remaining 68,694 reviews were written in Romanized Bangla, English, or a mix of languages. They are translated into Bangla with Google Translator and a custom Python program using the googletrans library. The translations are subsequently subjected to manual review and scrutiny to confirm their accuracy. https://github.com/mohsinulkabir14/BanglaBook/raw/main/images/banglabookgithub2.png" alt="drawing" style="width:1000px;"/>

    Results https://github.com/mohsinulkabir14/BanglaBook/raw/main/images/banglabookgithub3.png" alt="drawing" style="width:1000px;"/>

    Citation If you find this work useful, please cite our paper: bib @inproceedings{kabir-etal-2023-banglabook, title = "{B}angla{B}ook: A Large-scale {B}angla Dataset for Sentiment Analysis from Book Reviews", author = "Kabir, Mohsinul and Bin Mahfuz, Obayed and Raiyan, Syed Rifat and Mahmud, Hasan and Hasan, Md Kamrul", booktitle = "Findings of the Association for Computational Linguistics: ACL 2023", month = jul, year = "2023", address = "Toronto, Canada", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.findings-acl.80", pages = "1237--1247", abstract = "The analysis of consumer sentiment, as expressed through reviews, can provide a wealth of insight regarding the quality of a product. While the study of sentiment analysis has been widely explored in many popular languages, relatively less attention has been given to the Bangla language, mostly due to a lack of relevant data and cross-domain adaptability. To address this limitation, we present BanglaBook, a large-scale dataset of Bangla book reviews consisting of 158,065 samples classified into three broad categories: positive, negative, and neutral. We provide a detailed statistical analysis of the dataset and employ a range of machine learning models to establish baselines including SVM, LSTM, and Bangla-BERT. Our findings demonstrate a substantial performance advantage of pre-trained models over models that rely on manually crafted features, emphasizing the necessity for additional training resources in this domain. Additionally, we conduct an in-depth error analysis by examining sentiment unigrams, which may provide insight into common classification errors in under-resourced languages like Bangla. Our codes and data are publicly available at https://github.com/mohsinulkabir14/BanglaBook.", }

  10. Market Basket Analysis

    • kaggle.com
    Updated Dec 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 9, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Aslan Ahmedov
    Description

    Market Basket Analysis

    Market basket analysis with Apriori algorithm

    The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

    Introduction

    Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

    An Example of Association Rules

    Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

    Strategy

    • Data Import
    • Data Understanding and Exploration
    • Transformation of the data – so that is ready to be consumed by the association rules algorithm
    • Running association rules
    • Exploring the rules generated
    • Filtering the generated rules
    • Visualization of Rule

    Dataset Description

    • File name: Assignment-1_Data
    • List name: retaildata
    • File format: . xlsx
    • Number of Row: 522065
    • Number of Attributes: 7

      • BillNo: 6-digit number assigned to each transaction. Nominal.
      • Itemname: Product name. Nominal.
      • Quantity: The quantities of each product per transaction. Numeric.
      • Date: The day and time when each transaction was generated. Numeric.
      • Price: Product price. Numeric.
      • CustomerID: 5-digit number assigned to each customer. Nominal.
      • Country: Name of the country where each customer resides. Nominal.

    imagehttps://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

    Libraries in R

    First, we need to load required libraries. Shortly I describe all libraries.

    • arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).
    • arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.
    • tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.
    • readxl - Read Excel Files in R.
    • plyr - Tools for Splitting, Applying and Combining Data.
    • ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
    • knitr - Dynamic Report generation in R.
    • magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.
    • dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.
    • tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

    imagehttps://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

    Data Pre-processing

    Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

    imagehttps://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> imagehttps://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

    After we will clear our data frame, will remove missing values.

    imagehttps://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

    To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...

  11. f

    Goodness-of-fit filtering in classical metric multidimensional scaling with...

    • tandf.figshare.com
    pdf
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jan Graffelman (2023). Goodness-of-fit filtering in classical metric multidimensional scaling with large datasets [Dataset]. http://doi.org/10.6084/m9.figshare.11389830.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Jan Graffelman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Metric multidimensional scaling (MDS) is a widely used multivariate method with applications in almost all scientific disciplines. Eigenvalues obtained in the analysis are usually reported in order to calculate the overall goodness-of-fit of the distance matrix. In this paper, we refine MDS goodness-of-fit calculations, proposing additional point and pairwise goodness-of-fit statistics that can be used to filter poorly represented observations in MDS maps. The proposed statistics are especially relevant for large data sets that contain outliers, with typically many poorly fitted observations, and are helpful for improving MDS output and emphasizing the most important features of the dataset. Several goodness-of-fit statistics are considered, and both Euclidean and non-Euclidean distance matrices are considered. Some examples with data from demographic, genetic and geographic studies are shown.

  12. Z

    Data from: CESNET-QUIC22: A large one-month QUIC network traffic dataset...

    • data.niaid.nih.gov
    • explore.openaire.eu
    • +1more
    Updated Feb 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hynek, Karel (2024). CESNET-QUIC22: A large one-month QUIC network traffic dataset from backbone lines [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7409923
    Explore at:
    Dataset updated
    Feb 29, 2024
    Dataset provided by
    Lukačovič, Andrej
    Šiška, Pavel
    Luxemburk, Jan
    Hynek, Karel
    Čejka, Tomáš
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Please refer to the original data article for further data description: Jan Luxemburk et al. CESNET-QUIC22: A large one-month QUIC network traffic dataset from backbone lines, Data in Brief, 2023, 108888, ISSN 2352-3409, https://doi.org/10.1016/j.dib.2023.108888. We recommend using the CESNET DataZoo python library, which facilitates the work with large network traffic datasets. More information about the DataZoo project can be found in the GitHub repository https://github.com/CESNET/cesnet-datazoo. The QUIC (Quick UDP Internet Connection) protocol has the potential to replace TLS over TCP, which is the standard choice for reliable and secure Internet communication. Due to its design that makes the inspection of QUIC handshakes challenging and its usage in HTTP/3, there is an increasing demand for research in QUIC traffic analysis. This dataset contains one month of QUIC traffic collected in an ISP backbone network, which connects 500 large institutions and serves around half a million people. The data are delivered as enriched flows that can be useful for various network monitoring tasks. The provided server names and packet-level information allow research in the encrypted traffic classification area. Moreover, included QUIC versions and user agents (smartphone, web browser, and operating system identifiers) provide information for large-scale QUIC deployment studies. Data capture The data was captured in the flow monitoring infrastructure of the CESNET2 network. The capturing was done for four weeks between 31.10.2022 and 27.11.2022. The following list provides per-week flow count, capture period, and uncompressed size:

    W-2022-44

    Uncompressed Size: 19 GB Capture Period: 31.10.2022 - 6.11.2022 Number of flows: 32.6M W-2022-45

    Uncompressed Size: 25 GB Capture Period: 7.11.2022 - 13.11.2022 Number of flows: 42.6M W-2022-46

    Uncompressed Size: 20 GB Capture Period: 14.11.2022 - 20.11.2022 Number of flows: 33.7M W-2022-47

    Uncompressed Size: 25 GB Capture Period: 21.11.2022 - 27.11.2022 Number of flows: 44.1M CESNET-QUIC22

    Uncompressed Size: 89 GB Capture Period: 31.10.2022 - 27.11.2022 Number of flows: 153M

    Data description The dataset consists of network flows describing encrypted QUIC communications. Flows were created using ipfixprobe flow exporter and are extended with packet metadata sequences, packet histograms, and with fields extracted from the QUIC Initial Packet, which is the first packet of the QUIC connection handshake. The extracted handshake fields are the Server Name Indication (SNI) domain, the used version of the QUIC protocol, and the user agent string that is available in a subset of QUIC communications. Packet Sequences Flows in the dataset are extended with sequences of packet sizes, directions, and inter-packet times. For the packet sizes, we consider payload size after transport headers (UDP headers for the QUIC case). Packet directions are encoded as ±1, +1 meaning a packet sent from client to server, and -1 a packet from server to client. Inter-packet times depend on the location of communicating hosts, their distance, and on the network conditions on the path. However, it is still possible to extract relevant information that correlates with user interactions and, for example, with the time required for an API/server/database to process the received data and generate the response to be sent in the next packet. Packet metadata sequences have a length of 30, which is the default setting of the used flow exporter. We also derive three fields from each packet sequence: its length, time duration, and the number of roundtrips. The roundtrips are counted as the number of changes in the communication direction (from packet directions data); in other words, each client request and server response pair counts as one roundtrip. Flow statistics Flows also include standard flow statistics, which represent aggregated information about the entire bidirectional flow. The fields are: the number of transmitted bytes and packets in both directions, the duration of flow, and packet histograms. Packet histograms include binned counts of packet sizes and inter-packet times of the entire flow in both directions (more information in the PHISTS plugin documentation There are eight bins with a logarithmic scale; the intervals are 0-15, 16-31, 32-63, 64-127, 128-255, 256-511, 512-1024, >1024 [ms or B]. The units are milliseconds for inter-packet times and bytes for packet sizes. Moreover, each flow has its end reason - either it was idle, reached the active timeout, or ended due to other reasons. This corresponds with the official IANA IPFIX-specified values. The FLOW_ENDREASON_OTHER field represents the forced end and lack of resources reasons. The end of flow detected reason is not considered because it is not relevant for UDP connections. Dataset structure The dataset flows are delivered in compressed CSV files. CSV files contain one flow per row; data columns are summarized in the provided list below. For each flow data file, there is a JSON file with the number of saved and seen (before sampling) flows per service and total counts of all received (observed on the CESNET2 network), service (belonging to one of the dataset's services), and saved (provided in the dataset) flows. There is also the stats-week.json file aggregating flow counts of a whole week and the stats-dataset.json file aggregating flow counts for the entire dataset. Flow counts before sampling can be used to compute sampling ratios of individual services and to resample the dataset back to the original service distribution. Moreover, various dataset statistics, such as feature distributions and value counts of QUIC versions and user agents, are provided in the dataset-statistics folder. The mapping between services and service providers is provided in the servicemap.csv file, which also includes SNI domains used for ground truth labeling. The following list describes flow data fields in CSV files:

    ID: Unique identifier SRC_IP: Source IP address DST_IP: Destination IP address DST_ASN: Destination Autonomous System number SRC_PORT: Source port DST_PORT: Destination port PROTOCOL: Transport protocol QUIC_VERSION QUIC: protocol version QUIC_SNI: Server Name Indication domain QUIC_USER_AGENT: User agent string, if available in the QUIC Initial Packet TIME_FIRST: Timestamp of the first packet in format YYYY-MM-DDTHH-MM-SS.ffffff TIME_LAST: Timestamp of the last packet in format YYYY-MM-DDTHH-MM-SS.ffffff DURATION: Duration of the flow in seconds BYTES: Number of transmitted bytes from client to server BYTES_REV: Number of transmitted bytes from server to client PACKETS: Number of packets transmitted from client to server PACKETS_REV: Number of packets transmitted from server to client PPI: Packet metadata sequence in the format: [[inter-packet times], [packet directions], [packet sizes]] PPI_LEN: Number of packets in the PPI sequence PPI_DURATION: Duration of the PPI sequence in seconds PPI_ROUNDTRIPS: Number of roundtrips in the PPI sequence PHIST_SRC_SIZES: Histogram of packet sizes from client to server PHIST_DST_SIZES: Histogram of packet sizes from server to client PHIST_SRC_IPT: Histogram of inter-packet times from client to server PHIST_DST_IPT: Histogram of inter-packet times from server to client APP: Web service label CATEGORY: Service category FLOW_ENDREASON_IDLE: Flow was terminated because it was idle FLOW_ENDREASON_ACTIVE: Flow was terminated because it reached the active timeout FLOW_ENDREASON_OTHER: Flow was terminated for other reasons

    Link to other CESNET datasets

    https://www.liberouter.org/technology-v2/tools-services-datasets/datasets/ https://github.com/CESNET/cesnet-datazoo Please cite the original data article:

    @article{CESNETQUIC22, author = {Jan Luxemburk and Karel Hynek and Tomáš Čejka and Andrej Lukačovič and Pavel Šiška}, title = {CESNET-QUIC22: a large one-month QUIC network traffic dataset from backbone lines}, journal = {Data in Brief}, pages = {108888}, year = {2023}, issn = {2352-3409}, doi = {https://doi.org/10.1016/j.dib.2023.108888}, url = {https://www.sciencedirect.com/science/article/pii/S2352340923000069} }

  13. C

    Significant Lands - Water Lines

    • data.cnra.ca.gov
    • data.ca.gov
    • +5more
    Updated Apr 8, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    California State Lands Commission (2025). Significant Lands - Water Lines [Dataset]. https://data.cnra.ca.gov/dataset/significant-lands-water-lines
    Explore at:
    arcgis geoservices rest api, geojson, html, zip, gdb, txt, gpkg, csv, xlsx, kmlAvailable download formats
    Dataset updated
    Apr 8, 2025
    Dataset authored and provided by
    California State Lands Commissionhttps://www.slc.ca.gov/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The State Lands Commission has prepared the Significant Lands Inventory (report) for the California Legislature as a general identification and classification of those unconveyed State school lands and tide and submerged lands which possess significant environmental values. The publication incorporates evaluated and pertinent comments received on the initial draft report which was circulated statewide in February 1975.

    The absence of a particular digitized waterway in the dataset does not mean that the State does not claim ownership of that parcel or waterway, or that such specific parcel or waterway has no significant environmental values. This dataset is not intended to establish ownership, only to identify those parcels which possess significant environmental values. Staff was unable to physically inventory all of the considered lands; instead, the advice and participation of those with known environmental expertise was utilized as additional to staff survey.

    • Tide and submerged lands are digitized in the WaterBody and WaterLine feature classes; WaterLines for coastal areas, WaterBody for inland areas. Tide and submerged lands under the jurisdiction of the State Lands Commission are those sovereign lands received from the Federal Government by virtue of California's admission to the Union on an equal footing with the original States. Such lands, and State interest therein, are generally the lands waterward of the ordinary high water mark of the Pacific Ocean (seaward to a three-mile limit); tidal bays, sloughs, estuaries; and, navigable lakes and streams within the State.

    • School Lands are digitized in the SchoolLand feature class. State school lands under the jurisdiction of the Commission are largely composed of the 16th and 36th sections of each township. The Federal Government transferred these lands to the State in 1853, in order to establish a financial foundation for a public school system. In cases where the 16th and 36th sections were mineral in character, incomplete as to acreage total, or already claimed or granted by the Federal Government, the State was permitted to select other lands "in lieu" of the specific sections.

    The public trust of commerce, navigation and fisheries which the State retains on patented sovereign lands should also be considered included in this inventory. Wherever a waterway, or body of water, is listed or mapped, the common trust state interest in patented sovereign lands, if any, is also included.

    The State Lands Commission emphasized when it adopted this report at its December 1, 1975 meeting that all tide and submerged lands are significant by the nature of their public ownership. Only because of the methodology used for this report are all of these waterways not specifically listed in this inventory.

    It is the intent of the State Lands Commission that the Significant Lands Inventory be periodically updated. This dataset should be considered informational, to assist the Legislature, the Commission, and the public in considering the environmental aspects of a proposed project and the significant values to be protected therein.

  14. NCEI/WDS Global Significant Earthquake Database, 2150 BC to Present

    • catalog.data.gov
    • datasets.ai
    • +2more
    Updated Oct 18, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DOC/NOAA/NESDIS/NCEI > National Centers for Environmental Information, NESDIS, NOAA, U.S. Department of Commerce (Point of Contact) (2024). NCEI/WDS Global Significant Earthquake Database, 2150 BC to Present [Dataset]. https://catalog.data.gov/dataset/ncei-wds-global-significant-earthquake-database-2150-bc-to-present1
    Explore at:
    Dataset updated
    Oct 18, 2024
    Dataset provided by
    National Centers for Environmental Informationhttps://www.ncei.noaa.gov/
    National Oceanic and Atmospheric Administrationhttp://www.noaa.gov/
    United States Department of Commercehttp://www.commerce.gov/
    Description

    The Significant Earthquake Database is a global listing of over 5,700 earthquakes from 2150 BC to the present. A significant earthquake is classified as one that meets at least one of the following criteria: caused deaths, caused moderate damage (approximately $1 million or more), magnitude 7.5 or greater, Modified Mercalli Intensity (MMI) X or greater, or the earthquake generated a tsunami. The database provides information on the date and time of occurrence, latitude and longitude, focal depth, magnitude, maximum MMI intensity, and socio-economic data such as the total number of casualties, injuries, houses destroyed, and houses damaged, and $ dollage damage estimates. References, political geography, and additional comments are also provided for each earthquake. If the earthquake was associated with a tsunami or volcanic eruption, it is flagged and linked to the related tsunami event or significant volcanic eruption.

  15. d

    Data from: Fleet Level Anomaly Detection of Aviation Safety Data

    • catalog.data.gov
    • data.staging.idas-ds1.appdat.jsc.nasa.gov
    • +1more
    Updated Apr 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Fleet Level Anomaly Detection of Aviation Safety Data [Dataset]. https://catalog.data.gov/dataset/fleet-level-anomaly-detection-of-aviation-safety-data
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Dashlink
    Description

    For the purposes of this paper, the National Airspace System (NAS) encompasses the operations of all aircraft which are subject to air traffic control procedures. The NAS is a highly complex dynamic system that is sensitive to aeronautical decision-making and risk management skills. In order to ensure a healthy system with safe flights a systematic approach to anomaly detection is very important when evaluating a given set of circumstances and for determination of the best possible course of action. Given the fact that the NAS is a vast and loosely integrated network of systems, it requires improved safety assurance capabilities to maintain an extremely low accident rate under increasingly dense operating conditions. Data mining based tools and techniques are required to support and aid operators’ (such as pilots, management, or policy makers) overall decision-making capacity. Within the NAS, the ability to analyze fleetwide aircraft data autonomously is still considered a significantly challenging task. For our purposes a fleet is defined as a group of aircraft sharing generally compatible parameter lists. Here, in this effort, we aim at developing a system level analysis scheme. In this paper we address the capability for detection of fleetwide anomalies as they occur, which itself is an important initiative toward the safety of the real-world flight operations. The flight data recorders archive millions of data points with valuable information on flights everyday. The operational parameters consist of both continuous and discrete (binary & categorical) data from several critical subsystems and numerous complex procedures. In this paper, we discuss a system level anomaly detection approach based on the theory of kernel learning to detect potential safety anomalies in a very large data base of commercial aircraft. We also demonstrate that the proposed approach uncovers some operationally significant events due to environmental, mechanical, and human factors issues in high dimensional, multivariate Flight Operations Quality Assurance (FOQA) data. We present the results of our detection algorithms on real FOQA data from a regional carrier.

  16. h

    Data from: imdb

    • huggingface.co
    Updated Aug 3, 2003
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stanford NLP (2003). imdb [Dataset]. https://huggingface.co/datasets/stanfordnlp/imdb
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 3, 2003
    Dataset authored and provided by
    Stanford NLP
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for "imdb"

      Dataset Summary
    

    Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.

      Supported Tasks and Leaderboards
    

    More Information Needed

      Languages
    

    More Information Needed

      Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/imdb.
    
  17. Dataset for modeling spatial and temporal variation in natural background...

    • catalog.data.gov
    Updated Nov 12, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2020). Dataset for modeling spatial and temporal variation in natural background specific conductivity [Dataset]. https://catalog.data.gov/dataset/dataset-for-modeling-spatial-and-temporal-variation-in-natural-background-specific-conduct
    Explore at:
    Dataset updated
    Nov 12, 2020
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    This file contains the data set used to develop a random forest model predict background specific conductivity for stream segments in the contiguous United States. This Excel readable file contains 56 columns of parameters evaluated during development. The data dictionary provides the definition of the abbreviations and the measurement units. Each row is a unique sample described as R** which indicates the NHD Hydrologic Unit (underscore), up to a 7-digit COMID, (underscore) sequential sample month. To develop models that make stream-specific predictions across the contiguous United States, we used StreamCat data set and process (Hill et al. 2016; https://github.com/USEPA/StreamCat). The StreamCat data set is based on a network of stream segments from NHD+ (McKay et al. 2012). These stream segments drain an average area of 3.1 km2 and thus define the spatial grain size of this data set. The data set consists of minimally disturbed sites representing the natural variation in environmental conditions that occur in the contiguous 48 United States. More than 2.4 million SC observations were obtained from STORET (USEPA 2016b), state natural resource agencies, the U.S. Geological Survey (USGS) National Water Information System (NWIS) system (USGS 2016), and data used in Olson and Hawkins (2012) (Table S1). Data include observations made between 1 January 2001 and 31 December 2015 thus coincident with Moderate Resolution Imaging Spectroradiometer (MODIS) satellite data (https://modis.gsfc.nasa.gov/data/). Each observation was related to the nearest stream segment in the NHD+. Data were limited to one observation per stream segment per month. SC observations with ambiguous locations and repeat measurements along a stream segment in the same month were discarded. Using estimates of anthropogenic stress derived from the StreamCat database (Hill et al. 2016), segments were selected with minimal amounts of human activity (Stoddard et al. 2006) using criteria developed for each Level II Ecoregion (Omernik and Griffith 2014). Segments were considered as potentially minimally stressed where watersheds had 0 - 0.5% impervious surface, 0 – 5% urban, 0 – 10% agriculture, and population densities from 0.8 – 30 people/km2 (Table S3). Watersheds with observations with large residuals in initial models were identified and inspected for evidence of other human activities not represented in StreamCat (e.g., mining, logging, grazing, or oil/gas extraction). Observations were removed from disturbed watersheds, with a tidal influence or unusual geologic conditions such as hot springs. About 5% of SC observations in each National Rivers and Stream Assessment (NRSA) region were then randomly selected as independent validation data. The remaining observations became the large training data set for model calibration. This dataset is associated with the following publication: Olson, J., and S. Cormier. Modeling spatial and temporal variation in natural background specific conductivity. ENVIRONMENTAL SCIENCE & TECHNOLOGY. American Chemical Society, Washington, DC, USA, 53(8): 4316-4325, (2019).

  18. o

    Turkish Natural Language Inference Dataset

    • opendatabay.com
    .undefined
    Updated Jul 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Turkish Natural Language Inference Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/f4951f96-ebbc-43bf-bed5-36dce9796e6e
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 5, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Education & Learning Analytics
    Description

    The NLI-TR dataset, comprising two distinct datasets known as SNLI-TR and MNLI-TR, provides an unparalleled opportunity for research within the natural language processing (NLP) and machine learning communities. Its primary purpose is to facilitate natural language inference research in the Turkish language. The datasets consist of meticulously curated natural language inference data, which has been carefully translated into Turkish from original English sources. This resource enables researchers to develop automated models specifically tailored for making inferences on texts in this vibrant language. Furthermore, it offers valuable insights into cross-lingual generalisation capabilities, allowing investigation into how models trained on data from one language perform when applied to another. It supports tasks ranging from sentence paraphrasing and classification to question answering scenarios, featuring Turkish sentences labelled to indicate whether a premise and hypothesis entail, contradict, or are neutral towards each other.

    Columns

    The dataset records typically include the following columns:

    • premise: This column contains sentences written in Turkish. These sentences have been translated from the English sources used for the original SNLI and MNLI datasets. It serves as the contextual information or the initial statement from which an inference is to be made.
    • hypothesis: This column also contains sentences in Turkish, translated from the English SNLI and MNLI datasets. It represents the conclusion or the statement whose relationship to the premise is being assessed.
    • label: This column assigns a relationship between the premise and hypothesis. Possible values include:
      • 'entailment': The hypothesis logically follows from the premise.
      • 'contradiction': The hypothesis directly contradicts the premise.
      • 'neutral': The hypothesis is unrelated to or neither entails nor contradicts the premise.
    • domain: An optional column assigned by some authors, primarily used when inferences are made between sentences across different semantic domains, such as weather, sports, or finance.

    Distribution

    The data is typically provided in CSV file format. It includes both training and validation sets to support model development and evaluation. Key files mentioned are SNLI_tr_train.csv for training models, slni_tr_validation for testing or validating model accuracy on unseen data, and multinli_tr_validation_{matched / mismatched}.csv for additional validation on complex scenarios. The multinli_tr_train.csv file contains Turkish sentences with their corresponding labels. The dataset is considered large-scale, with the multinli_tr_train.csv file, for instance, containing approximately 392,700 records.

    Usage

    This dataset is ideal for various applications and use cases in NLP and machine learning:

    • Developing Natural Language Inference (NLI)-based question answering systems for the Turkish language.
    • Training sentiment analysis algorithms to discern sentiment in Turkish text.
    • Building Machine Learning Chatbots that leverage NLI to understand conversational context and respond appropriately in Turkish.
    • Conducting general NLI research in Turkish.
    • Investigating cross-lingual generalisation capabilities of NLP models.
    • Tasks such as sentence paraphrasing, classification, and other NLP techniques applied to Turkish text.

    Coverage

    The dataset's scope is primarily focused on the Turkish language, making it relevant for global use. The data has been translated from English sources, expanding its utility for cross-lingual studies. A specific time range or demographic scope for the data collection is not detailed in the available sources.

    License

    CC0

    Who Can Use It

    The NLI-TR dataset is intended for a broad audience interested in natural language processing and machine learning, including:

    • The natural language processing (NLP) community.
    • The machine learning community.
    • Seasoned and budding researchers looking to delve into NLI tasks.
    • Developers aiming to create automated models for Turkish language inference.
    • Academics and practitioners exploring the cross-lingual generalisation capabilities of models.
    • Anyone working on NLP tasks in Turkish, such as sentence paraphrasing, text classification, or question answering.

    Dataset Name Suggestions

    • NLI-TR (Turkish NLI Research)
    • Turkish Natural Language Inference Dataset
    • SNLI-TR and MNLI-TR Turkish Data
    • Turkish Textual Entailment Data

    Attributes

    Original Data Source: NLI-TR (Turkish NLI Research)

  19. a

    Hand Dataset

    • academictorrents.com
    bittorrent
    Updated Sep 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arpit Mittal and Andrew Zisserman and Philip H. S. Torr (2022). Hand Dataset [Dataset]. https://academictorrents.com/details/ddb78dcbe9985b51a397697a6d874b9dbc46300f
    Explore at:
    bittorrent(250460299)Available download formats
    Dataset updated
    Sep 5, 2022
    Dataset authored and provided by
    Arpit Mittal and Andrew Zisserman and Philip H. S. Torr
    License

    https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

    Description

    We introduce a comprehensive dataset of hand images collected from various different public image data set sources as listed in Table 1. A total of 13050 hand instances are annotated. Hand instances larger than a fixed area of bounding box (1500 sq. pixels) are considered big enough for detections and are used for evaluation. This gives around 4170 high quality hand instances. While collecting the data, no restriction was imposed on the pose or visibility of people, nor was any constraint imposed on the environment. In each image, all the hands that can be perceived clearly by humans are annotated. The annotations consist of a bounding rectangle, which does not have to be axis aligned, oriented with respect to the wrist.

  20. Data from: FISBe: A real-world benchmark dataset for instance segmentation...

    • zenodo.org
    • data.niaid.nih.gov
    bin, json +3
    Updated Apr 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lisa Mais; Lisa Mais; Peter Hirsch; Peter Hirsch; Claire Managan; Claire Managan; Ramya Kandarpa; Josef Lorenz Rumberger; Josef Lorenz Rumberger; Annika Reinke; Annika Reinke; Lena Maier-Hein; Lena Maier-Hein; Gudrun Ihrke; Gudrun Ihrke; Dagmar Kainmueller; Dagmar Kainmueller; Ramya Kandarpa (2024). FISBe: A real-world benchmark dataset for instance segmentation of long-range thin filamentous structures [Dataset]. http://doi.org/10.5281/zenodo.10875063
    Explore at:
    zip, text/x-python, bin, json, txtAvailable download formats
    Dataset updated
    Apr 2, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Lisa Mais; Lisa Mais; Peter Hirsch; Peter Hirsch; Claire Managan; Claire Managan; Ramya Kandarpa; Josef Lorenz Rumberger; Josef Lorenz Rumberger; Annika Reinke; Annika Reinke; Lena Maier-Hein; Lena Maier-Hein; Gudrun Ihrke; Gudrun Ihrke; Dagmar Kainmueller; Dagmar Kainmueller; Ramya Kandarpa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Feb 26, 2024
    Description

    General

    For more details and the most up-to-date information please consult our project page: https://kainmueller-lab.github.io/fisbe.

    Summary

    • A new dataset for neuron instance segmentation in 3d multicolor light microscopy data of fruit fly brains
      • 30 completely labeled (segmented) images
      • 71 partly labeled images
      • altogether comprising ∼600 expert-labeled neuron instances (labeling a single neuron takes between 30-60 min on average, yet a difficult one can take up to 4 hours)
    • To the best of our knowledge, the first real-world benchmark dataset for instance segmentation of long thin filamentous objects
    • A set of metrics and a novel ranking score for respective meaningful method benchmarking
    • An evaluation of three baseline methods in terms of the above metrics and score

    Abstract

    Instance segmentation of neurons in volumetric light microscopy images of nervous systems enables groundbreaking research in neuroscience by facilitating joint functional and morphological analyses of neural circuits at cellular resolution. Yet said multi-neuron light microscopy data exhibits extremely challenging properties for the task of instance segmentation: Individual neurons have long-ranging, thin filamentous and widely branching morphologies, multiple neurons are tightly inter-weaved, and partial volume effects, uneven illumination and noise inherent to light microscopy severely impede local disentangling as well as long-range tracing of individual neurons. These properties reflect a current key challenge in machine learning research, namely to effectively capture long-range dependencies in the data. While respective methodological research is buzzing, to date methods are typically benchmarked on synthetic datasets. To address this gap, we release the FlyLight Instance Segmentation Benchmark (FISBe) dataset, the first publicly available multi-neuron light microscopy dataset with pixel-wise annotations. In addition, we define a set of instance segmentation metrics for benchmarking that we designed to be meaningful with regard to downstream analyses. Lastly, we provide three baselines to kick off a competition that we envision to both advance the field of machine learning regarding methodology for capturing long-range data dependencies, and facilitate scientific discovery in basic neuroscience.

    Dataset documentation:

    We provide a detailed documentation of our dataset, following the Datasheet for Datasets questionnaire:

    >> FISBe Datasheet

    Our dataset originates from the FlyLight project, where the authors released a large image collection of nervous systems of ~74,000 flies, available for download under CC BY 4.0 license.

    Files

    • fisbe_v1.0_{completely,partly}.zip
      • contains the image and ground truth segmentation data; there is one zarr file per sample, see below for more information on how to access zarr files.
    • fisbe_v1.0_mips.zip
      • maximum intensity projections of all samples, for convenience.
    • sample_list_per_split.txt
      • a simple list of all samples and the subset they are in, for convenience.
    • view_data.py
      • a simple python script to visualize samples, see below for more information on how to use it.
    • dim_neurons_val_and_test_sets.json
      • a list of instance ids per sample that are considered to be of low intensity/dim; can be used for extended evaluation.
    • Readme.md
      • general information

    How to work with the image files

    Each sample consists of a single 3d MCFO image of neurons of the fruit fly.
    For each image, we provide a pixel-wise instance segmentation for all separable neurons.
    Each sample is stored as a separate zarr file (zarr is a file storage format for chunked, compressed, N-dimensional arrays based on an open-source specification.").
    The image data ("raw") and the segmentation ("gt_instances") are stored as two arrays within a single zarr file.
    The segmentation mask for each neuron is stored in a separate channel.
    The order of dimensions is CZYX.

    We recommend to work in a virtual environment, e.g., by using conda:

    conda create -y -n flylight-env -c conda-forge python=3.9
    conda activate flylight-env

    How to open zarr files

    1. Install the python zarr package:
      pip install zarr
    2. Opened a zarr file with:

      import zarr
      raw = zarr.open(
      seg = zarr.open(

      # optional:
      import numpy as np
      raw_np = np.array(raw)

    Zarr arrays are read lazily on-demand.
    Many functions that expect numpy arrays also work with zarr arrays.
    Optionally, the arrays can also explicitly be converted to numpy arrays.

    How to view zarr image files

    We recommend to use napari to view the image data.

    1. Install napari:
      pip install "napari[all]"
    2. Save the following Python script:

      import zarr, sys, napari

      raw = zarr.load(sys.argv[1], mode='r', path="volumes/raw")
      gts = zarr.load(sys.argv[1], mode='r', path="volumes/gt_instances")

      viewer = napari.Viewer(ndisplay=3)
      for idx, gt in enumerate(gts):
      viewer.add_labels(
      gt, rendering='translucent', blending='additive', name=f'gt_{idx}')
      viewer.add_image(raw[0], colormap="red", name='raw_r', blending='additive')
      viewer.add_image(raw[1], colormap="green", name='raw_g', blending='additive')
      viewer.add_image(raw[2], colormap="blue", name='raw_b', blending='additive')
      napari.run()

    3. Execute:
      python view_data.py 

    Metrics

    • S: Average of avF1 and C
    • avF1: Average F1 Score
    • C: Average ground truth coverage
    • clDice_TP: Average true positives clDice
    • FS: Number of false splits
    • FM: Number of false merges
    • tp: Relative number of true positives

    For more information on our selected metrics and formal definitions please see our paper.

    Baseline

    To showcase the FISBe dataset together with our selection of metrics, we provide evaluation results for three baseline methods, namely PatchPerPix (ppp), Flood Filling Networks (FFN) and a non-learnt application-specific color clustering from Duan et al..
    For detailed information on the methods and the quantitative results please see our paper.

    License

    The FlyLight Instance Segmentation Benchmark (FISBe) dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

    Citation

    If you use FISBe in your research, please use the following BibTeX entry:

    @misc{mais2024fisbe,
     title =    {FISBe: A real-world benchmark dataset for instance
             segmentation of long-range thin filamentous structures},
     author =    {Lisa Mais and Peter Hirsch and Claire Managan and Ramya
             Kandarpa and Josef Lorenz Rumberger and Annika Reinke and Lena
             Maier-Hein and Gudrun Ihrke and Dagmar Kainmueller},
     year =     2024,
     eprint =    {2404.00130},
     archivePrefix ={arXiv},
     primaryClass = {cs.CV}
    }

    Acknowledgments

    We thank Aljoscha Nern for providing unpublished MCFO images as well as Geoffrey W. Meissner and the entire FlyLight Project Team for valuable
    discussions.
    P.H., L.M. and D.K. were supported by the HHMI Janelia Visiting Scientist Program.
    This work was co-funded by Helmholtz Imaging.

    Changelog

    There have been no changes to the dataset so far.
    All future change will be listed on the changelog page.

    Contributing

    If you would like to contribute, have encountered any issues or have any suggestions, please open an issue for the FISBe dataset in the accompanying github repository.

    All contributions are welcome!

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Agricultural Research Service (2025). Current and projected research data storage needs of Agricultural Research Service researchers in 2016 [Dataset]. https://catalog.data.gov/dataset/current-and-projected-research-data-storage-needs-of-agricultural-research-service-researc-f33da
Organization logo

Data from: Current and projected research data storage needs of Agricultural Research Service researchers in 2016

Related Article
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Servicehttps://www.ars.usda.gov/
Description

The USDA Agricultural Research Service (ARS) recently established SCINet , which consists of a shared high performance computing resource, Ceres, and the dedicated high-speed Internet2 network used to access Ceres. Current and potential SCINet users are using and generating very large datasets so SCINet needs to be provisioned with adequate data storage for their active computing. It is not designed to hold data beyond active research phases. At the same time, the National Agricultural Library has been developing the Ag Data Commons, a research data catalog and repository designed for public data release and professional data curation. Ag Data Commons needs to anticipate the size and nature of data it will be tasked with handling. The ARS Web-enabled Databases Working Group, organized under the SCINet initiative, conducted a study to establish baseline data storage needs and practices, and to make projections that could inform future infrastructure design, purchases, and policies. The SCINet Web-enabled Databases Working Group helped develop the survey which is the basis for an internal report. While the report was for internal use, the survey and resulting data may be generally useful and are being released publicly. From October 24 to November 8, 2016 we administered a 17-question survey (Appendix A) by emailing a Survey Monkey link to all ARS Research Leaders, intending to cover data storage needs of all 1,675 SY (Category 1 and Category 4) scientists. We designed the survey to accommodate either individual researcher responses or group responses. Research Leaders could decide, based on their unit's practices or their management preferences, whether to delegate response to a data management expert in their unit, to all members of their unit, or to themselves collate responses from their unit before reporting in the survey. Larger storage ranges cover vastly different amounts of data so the implications here could be significant depending on whether the true amount is at the lower or higher end of the range. Therefore, we requested more detail from "Big Data users," those 47 respondents who indicated they had more than 10 to 100 TB or over 100 TB total current data (Q5). All other respondents are called "Small Data users." Because not all of these follow-up requests were successful, we used actual follow-up responses to estimate likely responses for those who did not respond. We defined active data as data that would be used within the next six months. All other data would be considered inactive, or archival. To calculate per person storage needs we used the high end of the reported range divided by 1 for an individual response, or by G, the number of individuals in a group response. For Big Data users we used the actual reported values or estimated likely values. Resources in this dataset:Resource Title: Appendix A: ARS data storage survey questions. File Name: Appendix A.pdfResource Description: The full list of questions asked with the possible responses. The survey was not administered using this PDF but the PDF was generated directly from the administered survey using the Print option under Design Survey. Asterisked questions were required. A list of Research Units and their associated codes was provided in a drop down not shown here. Resource Software Recommended: Adobe Acrobat,url: https://get.adobe.com/reader/ Resource Title: CSV of Responses from ARS Researcher Data Storage Survey. File Name: Machine-readable survey response data.csvResource Description: CSV file includes raw responses from the administered survey, as downloaded unfiltered from Survey Monkey, including incomplete responses. Also includes additional classification and calculations to support analysis. Individual email addresses and IP addresses have been removed. This information is that same data as in the Excel spreadsheet (also provided).Resource Title: Responses from ARS Researcher Data Storage Survey. File Name: Data Storage Survey Data for public release.xlsxResource Description: MS Excel worksheet that Includes raw responses from the administered survey, as downloaded unfiltered from Survey Monkey, including incomplete responses. Also includes additional classification and calculations to support analysis. Individual email addresses and IP addresses have been removed.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel

Search
Clear search
Close search
Google apps
Main menu