100+ datasets found
  1. d

    Data from: A large dataset of detection and submeter-accurate 3-D...

    • datadryad.org
    • explore.openaire.eu
    • +2more
    zip
    Updated Jul 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    A large dataset of detection and submeter-accurate 3-D trajectories of juvenile Chinook salmon [Dataset]. https://datadryad.org/stash/dataset/doi:10.5061/dryad.tdz08kpzd
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 14, 2021
    Dataset provided by
    Dryad
    Authors
    Jayson Martinez; Tao Fu; Xinya Li; Hongfei Hou; Jingxian Wang; Brad Eppard; Zhiqun Deng
    Time period covered
    2020
    Description

    Use of JSATS can generate a large volume of data. To manage and visualize the data, an integrated suite of science-based tools known as the Hydropower Biological Evaluation Toolset (HBET) can be used.

  2. Data from: ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction

    • zenodo.org
    • data.niaid.nih.gov
    csv, zip
    Updated Jan 27, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hossein Keshavarz; Hossein Keshavarz; Meiyappan Nagappan; Meiyappan Nagappan (2022). ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction [Dataset]. http://doi.org/10.5281/zenodo.5907002
    Explore at:
    zip, csvAvailable download formats
    Dataset updated
    Jan 27, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Hossein Keshavarz; Hossein Keshavarz; Meiyappan Nagappan; Meiyappan Nagappan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction
    
    This archive contains the ApacheJIT dataset presented in the paper "ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction" as well as the replication package. The paper is submitted to MSR 2022 Data Showcase Track.
    
    The datasets are available under directory dataset. There are 4 datasets in this directory. 
    
    1. apachejit_total.csv: This file contains the entire dataset. Commits are specified by their identifier and a set of commit metrics that are explained in the paper are provided as features. Column buggy specifies whether or not the commit introduced any bug into the system. 
    2. apachejit_train.csv: This file is a subset of the entire dataset. It provides a balanced set that we recommend for models that are sensitive to class imbalance. This set is obtained from the first 14 years of data (2003 to 2016).
    3. apachejit_test_large.csv: This file is a subset of the entire dataset. The commits in this file are the commits from the last 3 years of data. This set is not balanced to represent a real-life scenario in a JIT model evaluation where the model is trained on historical data to be applied on future data without any modification.
    4. apachejit_test_small.csv: This file is a subset of the test file explained above. Since the test file has more than 30,000 commits, we also provide a smaller test set which is still unbalanced and from the last 3 years of data.
    
    In addition to the dataset, we also provide the scripts using which we built the dataset. These scripts are written in Python 3.8. Therefore, Python 3.8 or above is required. To set up the environment, we have provided a list of required packages in file requirements.txt. Additionally, one filtering step requires GumTree [1]. For Java, GumTree requires Java 11. For other languages, external tools are needed. Installation guide and more details can be found here.
    
    The scripts are comprised of Python scripts under directory src and Python notebooks under directory notebooks. The Python scripts are mainly responsible for conducting GitHub search via GitHub search API and collecting commits through PyDriller Package [2]. The notebooks link the fixed issue reports with their corresponding fixing commits and apply some filtering steps. The bug-inducing candidates then are filtered again using gumtree.py script that utilizes the GumTree package. Finally, the remaining bug-inducing candidates are combined with the clean commits in the dataset_construction notebook to form the entire dataset.
    
    More specifically, git_token.py handles GitHub API token that is necessary for requests to GitHub API. Script collector.py performs GitHub search. Tracing changed lines and git annotate is done in gitminer.py using PyDriller. Finally, gumtree.py applies 4 filtering steps (number of lines, number of files, language, and change significance).
    
    References:
    
    1. GumTree
    
    * https://github.com/GumTreeDiff/gumtree
    
    Jean-Rémy Falleri, Floréal Morandat, Xavier Blanc, Matias Martinez, and Martin Monperrus. 2014. Fine-grained and accurate source code differencing. In ACM/IEEE International Conference on Automated Software Engineering, ASE ’14,Vasteras, Sweden - September 15 - 19, 2014. 313–324
    
    2. PyDriller
    
    * https://pydriller.readthedocs.io/en/latest/
    
    * Davide Spadini, Maurício Aniche, and Alberto Bacchelli. 2018. PyDriller: Python Framework for Mining Software Repositories. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Lake Buena Vista, FL, USA)(ESEC/FSE2018). Association for Computing Machinery, New York, NY, USA, 908–911
    
    
  3. Z

    Data from: A Large-scale Dataset of (Open Source) License Text Variants

    • data.niaid.nih.gov
    • zenodo.org
    Updated Mar 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stefano Zacchiroli (2022). A Large-scale Dataset of (Open Source) License Text Variants [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6379163
    Explore at:
    Dataset updated
    Mar 30, 2022
    Dataset authored and provided by
    Stefano Zacchiroli
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We introduce a large-scale dataset of the complete texts of free/open source software (FOSS) license variants. To assemble it we have collected from the Software Heritage archive—the largest publicly available archive of FOSS source code with accompanying development history—all versions of files whose names are commonly used to convey licensing terms to software users and developers. The dataset consists of 6.5 million unique license files that can be used to conduct empirical studies on open source licensing, training of automated license classifiers, natural language processing (NLP) analyses of legal texts, as well as historical and phylogenetic studies on FOSS licensing. Additional metadata about shipped license files are also provided, making the dataset ready to use in various contexts; they include: file length measures, detected MIME type, detected SPDX license (using ScanCode), example origin (e.g., GitHub repository), oldest public commit in which the license appeared. The dataset is released as open data as an archive file containing all deduplicated license blobs, plus several portable CSV files for metadata, referencing blobs via cryptographic checksums.

    For more details see the included README file and companion paper:

    Stefano Zacchiroli. A Large-scale Dataset of (Open Source) License Text Variants. In proceedings of the 2022 Mining Software Repositories Conference (MSR 2022). 23-24 May 2022 Pittsburgh, Pennsylvania, United States. ACM 2022.

    If you use this dataset for research purposes, please acknowledge its use by citing the above paper.

  4. MEI Large Data Set 5

    • kaggle.com
    zip
    Updated Feb 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ian Dickerson (2022). MEI Large Data Set 5 [Dataset]. https://www.kaggle.com/datasets/mathsian/mei-large-data-set-5/discussion
    Explore at:
    zip(133362 bytes)Available download formats
    Dataset updated
    Feb 2, 2022
    Authors
    Ian Dickerson
    Description

    Dataset

    This dataset was created by Ian Dickerson

    Contents

  5. A large database of motor imagery EEG signals and users' demographic,...

    • zenodo.org
    Updated Sep 13, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dreyer Pauline; Roc Aline; Rimbert Sébastien; Pillette Léa; Lotte Fabien; Dreyer Pauline; Roc Aline; Rimbert Sébastien; Pillette Léa; Lotte Fabien (2023). A large database of motor imagery EEG signals and users' demographic, personality and cognitive profile information for Brain-Computer Interface research [Dataset]. http://doi.org/10.5281/zenodo.7516451
    Explore at:
    Dataset updated
    Sep 13, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Dreyer Pauline; Roc Aline; Rimbert Sébastien; Pillette Léa; Lotte Fabien; Dreyer Pauline; Roc Aline; Rimbert Sébastien; Pillette Léa; Lotte Fabien
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Context :
    We share a large database containing electroencephalographic signals from 87 human participants, with more than 20,800 trials in total representing about 70 hours of recording. It was collected during brain-computer interface (BCI) experiments and organized into 3 datasets (A, B, and C) that were all recorded following the same protocol: right and left hand motor imagery (MI) tasks during one single day session.
    It includes the performance of the associated BCI users, detailed information about the demographics, personality and cognitive user’s profile, and the experimental instructions and codes (executed in the open-source platform OpenViBE).
    Such database could prove useful for various studies, including but not limited to: 1) studying the relationships between BCI users' profiles and their BCI performances, 2) studying how EEG signals properties varies for different users' profiles and MI tasks, 3) using the large number of participants to design cross-user BCI machine learning algorithms or 4) incorporating users' profile information into the design of EEG signal classification algorithms.

    Sixty participants (Dataset A) performed the first experiment, designed in order to investigated the impact of experimenters' and users' gender on MI-BCI user training outcomes, i.e., users performance and experience, (Pillette & al). Twenty one participants (Dataset B) performed the second one, designed to examined the relationship between users' online performance (i.e., classification accuracy) and the characteristics of the chosen user-specific Most Discriminant Frequency Band (MDFB) (Benaroch & al). The only difference between the two experiments lies in the algorithm used to select the MDFB. Dataset C contains 6 additional participants who completed one of the two experiments described above. Physiological signals were measured using a g.USBAmp (g.tec, Austria), sampled at 512 Hz, and processed online using OpenViBE 2.1.0 (Dataset A) & OpenVIBE 2.2.0 (Dataset B). For Dataset C, participants C83 and C85 were collected with OpenViBE 2.1.0 and the remaining 4 participants with OpenViBE 2.2.0. Experiments were recorded at Inria Bordeaux sud-ouest, France.

    Duration : Each participant's folder is composed of approximately 48 minutes EEG recording. Meaning six 7-minutes runs and a 6-minutes baseline.


    Documents
    Instructions: checklist read by experimenters during the experiments.
    Questionnaires: the Mental Rotation test used, the translation of 4 questionnaires, notably the Demographic and Social information, the Pre and Post-session questionnaires, and the Index of Learning style. English and french version
    Performance: The online OpenViBE BCI classification performances obtained by each participant are provided for each run, as well as answers to all questionnaires
    Scenarios/scripts : set of OpenViBE scenarios used to perform each of the steps of the MI-BCI protocol, e.g., acquire training data, calibrate the classifier or run the online MI-BCI

    Database : raw signals
    Dataset A : N=60 participants
    Dataset B : N=21 participants
    Dataset C : N=6 participants

  6. N

    Excel, AL Age Group Population Dataset: A Complete Breakdown of Excel Age...

    • neilsberg.com
    csv, json
    Updated Jul 24, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neilsberg Research (2024). Excel, AL Age Group Population Dataset: A Complete Breakdown of Excel Age Demographics from 0 to 85 Years and Over, Distributed Across 18 Age Groups // 2024 Edition [Dataset]. https://www.neilsberg.com/research/datasets/aa8c95e0-4983-11ef-ae5d-3860777c1fe6/
    Explore at:
    csv, jsonAvailable download formats
    Dataset updated
    Jul 24, 2024
    Dataset authored and provided by
    Neilsberg Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Excel
    Variables measured
    Population Under 5 Years, Population over 85 years, Population Between 5 and 9 years, Population Between 10 and 14 years, Population Between 15 and 19 years, Population Between 20 and 24 years, Population Between 25 and 29 years, Population Between 30 and 34 years, Population Between 35 and 39 years, Population Between 40 and 44 years, and 9 more
    Measurement technique
    The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2018-2022 5-Year Estimates. To measure the two variables, namely (a) population and (b) population as a percentage of the total population, we initially analyzed and categorized the data for each of the age groups. For age groups we divided it into roughly a 5 year bucket for ages between 0 and 85. For over 85, we aggregated data into a single group for all ages. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
    Dataset funded by
    Neilsberg Research
    Description
    About this dataset

    Context

    The dataset tabulates the Excel population distribution across 18 age groups. It lists the population in each age group along with the percentage population relative of the total population for Excel. The dataset can be utilized to understand the population distribution of Excel by age. For example, using this dataset, we can identify the largest age group in Excel.

    Key observations

    The largest age group in Excel, AL was for the group of age 45 to 49 years years with a population of 74 (15.64%), according to the ACS 2018-2022 5-Year Estimates. At the same time, the smallest age group in Excel, AL was the 85 years and over years with a population of 2 (0.42%). Source: U.S. Census Bureau American Community Survey (ACS) 2018-2022 5-Year Estimates

    Content

    When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2018-2022 5-Year Estimates

    Age groups:

    • Under 5 years
    • 5 to 9 years
    • 10 to 14 years
    • 15 to 19 years
    • 20 to 24 years
    • 25 to 29 years
    • 30 to 34 years
    • 35 to 39 years
    • 40 to 44 years
    • 45 to 49 years
    • 50 to 54 years
    • 55 to 59 years
    • 60 to 64 years
    • 65 to 69 years
    • 70 to 74 years
    • 75 to 79 years
    • 80 to 84 years
    • 85 years and over

    Variables / Data Columns

    • Age Group: This column displays the age group in consideration
    • Population: The population for the specific age group in the Excel is shown in this column.
    • % of Total Population: This column displays the population of each age group as a proportion of Excel total population. Please note that the sum of all percentages may not equal one due to rounding of values.

    Good to know

    Margin of Error

    Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

    Custom data

    If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

    Inspiration

    Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

    Recommended for further research

    This dataset is a part of the main dataset for Excel Population by Age. You can refer the same here

  7. u

    Data from: Current and projected research data storage needs of Agricultural...

    • agdatacommons.nal.usda.gov
    • datasets.ai
    • +4more
    pdf
    Updated Nov 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cynthia Parr (2023). Current and projected research data storage needs of Agricultural Research Service researchers in 2016 [Dataset]. http://doi.org/10.15482/USDA.ADC/1346946
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Nov 30, 2023
    Dataset provided by
    Ag Data Commons
    Authors
    Cynthia Parr
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The USDA Agricultural Research Service (ARS) recently established SCINet , which consists of a shared high performance computing resource, Ceres, and the dedicated high-speed Internet2 network used to access Ceres. Current and potential SCINet users are using and generating very large datasets so SCINet needs to be provisioned with adequate data storage for their active computing. It is not designed to hold data beyond active research phases. At the same time, the National Agricultural Library has been developing the Ag Data Commons, a research data catalog and repository designed for public data release and professional data curation. Ag Data Commons needs to anticipate the size and nature of data it will be tasked with handling. The ARS Web-enabled Databases Working Group, organized under the SCINet initiative, conducted a study to establish baseline data storage needs and practices, and to make projections that could inform future infrastructure design, purchases, and policies. The SCINet Web-enabled Databases Working Group helped develop the survey which is the basis for an internal report. While the report was for internal use, the survey and resulting data may be generally useful and are being released publicly. From October 24 to November 8, 2016 we administered a 17-question survey (Appendix A) by emailing a Survey Monkey link to all ARS Research Leaders, intending to cover data storage needs of all 1,675 SY (Category 1 and Category 4) scientists. We designed the survey to accommodate either individual researcher responses or group responses. Research Leaders could decide, based on their unit's practices or their management preferences, whether to delegate response to a data management expert in their unit, to all members of their unit, or to themselves collate responses from their unit before reporting in the survey.
    Larger storage ranges cover vastly different amounts of data so the implications here could be significant depending on whether the true amount is at the lower or higher end of the range. Therefore, we requested more detail from "Big Data users," those 47 respondents who indicated they had more than 10 to 100 TB or over 100 TB total current data (Q5). All other respondents are called "Small Data users." Because not all of these follow-up requests were successful, we used actual follow-up responses to estimate likely responses for those who did not respond. We defined active data as data that would be used within the next six months. All other data would be considered inactive, or archival. To calculate per person storage needs we used the high end of the reported range divided by 1 for an individual response, or by G, the number of individuals in a group response. For Big Data users we used the actual reported values or estimated likely values.

    Resources in this dataset:Resource Title: Appendix A: ARS data storage survey questions. File Name: Appendix A.pdfResource Description: The full list of questions asked with the possible responses. The survey was not administered using this PDF but the PDF was generated directly from the administered survey using the Print option under Design Survey. Asterisked questions were required. A list of Research Units and their associated codes was provided in a drop down not shown here. Resource Software Recommended: Adobe Acrobat,url: https://get.adobe.com/reader/ Resource Title: CSV of Responses from ARS Researcher Data Storage Survey. File Name: Machine-readable survey response data.csvResource Description: CSV file includes raw responses from the administered survey, as downloaded unfiltered from Survey Monkey, including incomplete responses. Also includes additional classification and calculations to support analysis. Individual email addresses and IP addresses have been removed. This information is that same data as in the Excel spreadsheet (also provided).Resource Title: Responses from ARS Researcher Data Storage Survey. File Name: Data Storage Survey Data for public release.xlsxResource Description: MS Excel worksheet that Includes raw responses from the administered survey, as downloaded unfiltered from Survey Monkey, including incomplete responses. Also includes additional classification and calculations to support analysis. Individual email addresses and IP addresses have been removed.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel

  8. T

    imdb_reviews

    • tensorflow.org
    Updated Sep 20, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). imdb_reviews [Dataset]. https://www.tensorflow.org/datasets/catalog/imdb_reviews
    Explore at:
    Dataset updated
    Sep 20, 2024
    Description

    Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('imdb_reviews', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  9. c

    SAROS - A large, heterogeneous, and sparsely annotated segmentation dataset...

    • cancerimagingarchive.net
    csv, n/a +1
    Updated Mar 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Cancer Imaging Archive (2024). SAROS - A large, heterogeneous, and sparsely annotated segmentation dataset on CT imaging data [Dataset]. http://doi.org/10.25737/SZ96-ZG60
    Explore at:
    csv, n/a, nifti and zipAvailable download formats
    Dataset updated
    Mar 7, 2024
    Dataset authored and provided by
    The Cancer Imaging Archive
    License

    https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/

    Time period covered
    Mar 7, 2024
    Dataset funded by
    National Cancer Institutehttp://www.cancer.gov/
    Description
    Sparsely Annotated Region and Organ Segmentation (SAROS) contributes a large heterogeneous semantic segmentation annotation dataset for existing CT imaging cases on TCIA. The goal of this dataset is to provide high-quality annotations for building body composition analysis tools (References: Koitka 2020 and Haubold 2023). Existing in-house segmentation models were employed to generate annotation candidates on randomly selected cases. All generated annotations were manually reviewed and corrected by medical residents and students on every fifth axial slice while other slices were set to an ignore label (numeric value 255). 900 CT series from 882 patients were randomly selected from the following TCIA collections (number of CTs per collection in parenthesis): ACRIN-FLT-Breast (32), ACRIN-HNSCC-FDG-PET/CT (48), ACRIN-NSCLC-FDG-PET (129), Anti-PD-1_Lung (12), Anti-PD-1_MELANOMA (2), C4KC-KiTS (175), COVID-19-NY-SBU (1), CPTAC-CM (1), CPTAC-LSCC (3), CPTAC-LUAD (1), CPTAC-PDA (8), CPTAC-UCEC (26), HNSCC (17), Head-Neck Cetuximab (12), LIDC-IDRI (133), Lung-PET-CT-Dx (17), NSCLC Radiogenomics (7), NSCLC-Radiomics (56), NSCLC-Radiomics-Genomics (20), Pancreas-CT (58), QIN-HEADNECK (94), Soft-tissue-Sarcoma (6), TCGA-HNSC (1), TCGA-LIHC (33), TCGA-LUAD (2), TCGA-LUSC (3), TCGA-STAD (2), TCGA-UCEC (1). A script to download and resample the images is provided in our GitHub repository: https://github.com/UMEssen/saros-dataset The annotations are provided in NIfTI format and were performed on 5mm slice thickness. The annotation files define foreground labels on the same axial slices and match pixel-perfect. In total, 13 semantic body regions and 6 body part labels were annotated with an index that corresponds to a numeric value in the segmentation file.

    Body Regions

    1. Subcutaneous Tissue
    2. Muscle
    3. Abdominal Cavity
    4. Thoracic Cavity
    5. Bones
    6. Parotid Glands
    7. Pericardium
    8. Breast Implant
    9. Mediastinum
    10. Brain
    11. Spinal Cord
    12. Thyroid Glands
    13. Submandibular Glands

    Body Parts

    1. Torso
    2. Head
    3. Right Leg
    4. Left Leg
    5. Right Arm
    6. Left Arm
    The labels which were modified or require further commentary are listed and explained below:
    • Subcutaneous Adipose Tissue: The cutis was included into this label due to its limited differentiation in 5mm-CT.
    • Muscle: All muscular tissue was segmented contiguously and not separated into single muscles. Thus, fascias and intermuscular fat were included into the label. Inter- and intramuscular fat is subtracted automatically in the process.
    • Abdominal Cavity: This label includes the pelvis. The label does not separate between the positional relationships of the peritoneum.
    • Mediastinum: The International Thymic Malignancy Group (ITMIG) scheme was used for the segmentation guidelines.
    • Head + Neck: The neck is confined by the base of the trapezius muscle.
    • Right + Left Leg: The legs are separated from the torso by the line between the two lowest points of the Rami ossa pubis.
    • Right + Left Arm: The arms are separated from the torso by the diagonal between the most lateral point of the acromion and the tuberculum infraglenoidale.
    For reproducibility on downstream tasks, five cross-validation folds and a test set were pre-defined and are described in the provided spreadsheet. Segmentation was conducted strictly in accordance with anatomical guidelines and only modified if required for the gain of segmentation efficiency.

  10. f

    Large scale API Usage dataset

    • figshare.com
    • data.4tu.nl
    bin
    Updated Jun 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anand Sawant (2023). Large scale API Usage dataset [Dataset]. http://doi.org/10.4121/uuid:cb751e3e-3034-44a1-b0c1-b23128927dd8
    Explore at:
    binAvailable download formats
    Dataset updated
    Jun 6, 2023
    Dataset provided by
    4TU.ResearchData
    Authors
    Anand Sawant
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This is data collected from 50 APIs on their usage among over 200,000 GitHub consumers.

  11. Data from: Criminal Recidivism in a Large Cohort of Offenders Released from...

    • catalog.data.gov
    • s.cnmilf.com
    • +1more
    Updated Mar 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Justice (2025). Criminal Recidivism in a Large Cohort of Offenders Released from Prison in Florida, 2004-2008 [Dataset]. https://catalog.data.gov/dataset/criminal-recidivism-in-a-large-cohort-of-offenders-released-from-prison-in-florida-2004-20-98557
    Explore at:
    Dataset updated
    Mar 12, 2025
    Dataset provided by
    National Institute of Justicehttp://nij.ojp.gov/
    Area covered
    Florida
    Description

    The purpose of the study was to quantify the effect of the embrace of DNA technology on offender behavior. In particular, researchers examined whether an offender's knowledge that their DNA profile was entered into a database deterred them from offending in the future and if probative effects resulted from DNA sampling. The researchers coded information using criminal history records and data from Florida's DNA database, both of which are maintained by the Florida Department of Law Enforcement (FDLE), and also utilized court docket information acquired through the Florida Department of Corrections (FDOC) to create a dataset of 156,702 cases involving offenders released from the FDOC in the state of Florida between January 1996 and December 2004. The data contain a total of 50 variables. Major categories of variables include demographic variables regarding the offender, descriptive variables relating to the initial crime committed by the offender, and time-specific variables regarding cases of recidivism.

  12. N

    Big Stone Gap, VA Age Group Population Dataset: A Complete Breakdown of Big...

    • neilsberg.com
    csv, json
    Updated Feb 22, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Big Stone Gap, VA Age Group Population Dataset: A Complete Breakdown of Big Stone Gap Age Demographics from 0 to 85 Years and Over, Distributed Across 18 Age Groups // 2025 Edition [Dataset]. https://www.neilsberg.com/insights/big-stone-gap-va-population-by-age/
    Explore at:
    csv, jsonAvailable download formats
    Dataset updated
    Feb 22, 2025
    Dataset authored and provided by
    Neilsberg Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Big Stone Gap, Virginia
    Variables measured
    Population Under 5 Years, Population over 85 years, Population Between 5 and 9 years, Population Between 10 and 14 years, Population Between 15 and 19 years, Population Between 20 and 24 years, Population Between 25 and 29 years, Population Between 30 and 34 years, Population Between 35 and 39 years, Population Between 40 and 44 years, and 9 more
    Measurement technique
    The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. To measure the two variables, namely (a) population and (b) population as a percentage of the total population, we initially analyzed and categorized the data for each of the age groups. For age groups we divided it into roughly a 5 year bucket for ages between 0 and 85. For over 85, we aggregated data into a single group for all ages. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
    Dataset funded by
    Neilsberg Research
    Description
    About this dataset

    Context

    The dataset tabulates the Big Stone Gap population distribution across 18 age groups. It lists the population in each age group along with the percentage population relative of the total population for Big Stone Gap. The dataset can be utilized to understand the population distribution of Big Stone Gap by age. For example, using this dataset, we can identify the largest age group in Big Stone Gap.

    Key observations

    The largest age group in Big Stone Gap, VA was for the group of age 30 to 34 years years with a population of 602 (11.59%), according to the ACS 2019-2023 5-Year Estimates. At the same time, the smallest age group in Big Stone Gap, VA was the 85 years and over years with a population of 57 (1.10%). Source: U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates

    Content

    When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates

    Age groups:

    • Under 5 years
    • 5 to 9 years
    • 10 to 14 years
    • 15 to 19 years
    • 20 to 24 years
    • 25 to 29 years
    • 30 to 34 years
    • 35 to 39 years
    • 40 to 44 years
    • 45 to 49 years
    • 50 to 54 years
    • 55 to 59 years
    • 60 to 64 years
    • 65 to 69 years
    • 70 to 74 years
    • 75 to 79 years
    • 80 to 84 years
    • 85 years and over

    Variables / Data Columns

    • Age Group: This column displays the age group in consideration
    • Population: The population for the specific age group in the Big Stone Gap is shown in this column.
    • % of Total Population: This column displays the population of each age group as a proportion of Big Stone Gap total population. Please note that the sum of all percentages may not equal one due to rounding of values.

    Good to know

    Margin of Error

    Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

    Custom data

    If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

    Inspiration

    Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

    Recommended for further research

    This dataset is a part of the main dataset for Big Stone Gap Population by Age. You can refer the same here

  13. m

    Motamot: A Dataset for Revealing the Supremacy of Large Language Models over...

    • data.mendeley.com
    Updated May 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fatema Tuj Johora Faria (2024). Motamot: A Dataset for Revealing the Supremacy of Large Language Models over Transformer Models in Bengali Political Sentiment Analysis [Dataset]. http://doi.org/10.17632/hdhnrrwdz2.1
    Explore at:
    Dataset updated
    May 13, 2024
    Authors
    Fatema Tuj Johora Faria
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset "Motamot" containing 7,058 data points labeled with Positive and Negative sentiments, tailored specifically for Political Sentiment Analysis in the Bengali language. The dataset comprises 4,132 instances labeled as Positive and 2,926 instances labeled as Negative sentiments.

    Specifics of the Core Data: —------------------------------- Train 5647, Test 706, Validation 705

    Train : —-------------------------------

    Positive: 3306

    Negative: 2341

    Test : —-------------------------------

    Positive: 413

    Negative: 293

    Validation : —-------------------------------

    Positive: 413

    Negative: 292

  14. P

    Bridge Data Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Aug 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Frederik Ebert; Yanlai Yang; Karl Schmeckpeper; Bernadette Bucher; Georgios Georgakis; Kostas Daniilidis; Chelsea Finn; Sergey Levine (2024). Bridge Data Dataset [Dataset]. https://paperswithcode.com/dataset/bridge-data
    Explore at:
    Dataset updated
    Aug 20, 2024
    Authors
    Frederik Ebert; Yanlai Yang; Karl Schmeckpeper; Bernadette Bucher; Georgios Georgakis; Kostas Daniilidis; Chelsea Finn; Sergey Levine
    Description

    Bridge Data is a large multi-domain and multi-task dataset, with 7,200 demonstrations constituting 71 tasks across 10 environments. The dataset is collected using a low-cost yet versatile 6-DoF WidowX250 robot arm and contains 7,200 demonstrations of a robot performing 71 kitchen tasks across 10 environments with varying lighting, robot positions, and backgrounds. It can be used to boosting generalization of robotic skills and empirically study how it can improve the learning of new tasks in new environments.

  15. u

    Data from: MobileWell400+: A Large-Scale Multivariate Longitudinal Mobile...

    • produccioncientifica.ucm.es
    • produccioncientifica.ugr.es
    • +1more
    Updated 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Banos, Oresti; Damas, Miguel; Goicoechea, Carmen; Perakakis, Pandelis; Pomares, Hector; Rodriguez-Leon, Ciro; Sanabria, Daniel; Villalonga, Claudia; Banos, Oresti; Damas, Miguel; Goicoechea, Carmen; Perakakis, Pandelis; Pomares, Hector; Rodriguez-Leon, Ciro; Sanabria, Daniel; Villalonga, Claudia (2024). MobileWell400+: A Large-Scale Multivariate Longitudinal Mobile Dataset for Investigating Individual and Collective Well-Being [Dataset]. https://produccioncientifica.ucm.es/documentos/668fc499b9e7c03b01be2372
    Explore at:
    Dataset updated
    2024
    Authors
    Banos, Oresti; Damas, Miguel; Goicoechea, Carmen; Perakakis, Pandelis; Pomares, Hector; Rodriguez-Leon, Ciro; Sanabria, Daniel; Villalonga, Claudia; Banos, Oresti; Damas, Miguel; Goicoechea, Carmen; Perakakis, Pandelis; Pomares, Hector; Rodriguez-Leon, Ciro; Sanabria, Daniel; Villalonga, Claudia
    Description

    This study engaged 409 participants over a period spanning from July 10 to August 8, 2023, ensuring representation across various demographic factors: 221 females, 186 males, 2 non-binary, year of birth between 1951 and 2005, with varied annual incomes and from 15 Spanish regions. The MobileWell400+ dataset, openly accessible, encompasses a wide array of data collected via the participants' mobile phone, including demographic, emotional, social, behavioral, and well-being data. Methodologically, the project presents a promising avenue for uncovering new social, behavioral, and emotional indicators, supplementing existing literature. Notably, artificial intelligence is considered to be instrumental in analysing these data, discerning patterns, and forecasting trends, thereby advancing our comprehension of individual and population well-being. Ethical standards were upheld, with participants providing informed consent.

    The following is a non-exhaustive list of collected data:

    Data continuously collected through the participants' smartphone sensors: physical activity (resting, walking, driving, cycling, etc.), name of detected WiFi networks, connectivity type (WiFi, mobile, none), ambient light, ambient noise, and status of the device screen (on, off, locked, unlocked).

    Data corresponding to an initial survey prompted via the smartphone, with information related to demographic data, effects and COVID vaccination, average hours of physical activity, and answers to a series of questions to measure mental health, many of them taken from internationally recognised psychological and well-being scales (PANAS, PHQ, GAD, BRS and AAQ), social isolation (TILS) and economic inequality perception.

    Data corresponding to daily surveys prompted via the smartphone, where variables related to mood (valence, activation, energy and emotional events) and social interaction (quantity and quality) are measured.

    Data corresponding to weekly surveys prompted via the smartphone, where information on overall health, hours of physical activity per week, lonileness, and questions related to well-being are asked.

    Data corresponding to an final survey prompted via the smartphone, consisting of similar questions to the ones asked in the initial survey, namely psychological and well-being items (PANAS, PHQ, GAD, BRS and AAQ), social isolation (TILS) and economic inequality perception questions.

    For a more detailed description of the study please refer to MobileWell400+StudyDescription.pdf.

    For a more detailed description of the collected data, variables and data files please refer to MobileWell400+FilesDescription.pdf.

  16. c

    UCDP External Support Dataset

    • datacatalogue.cessda.eu
    • snd.se
    Updated Aug 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Högbladh, Stina; Pettersson, Therése; Themnér, Lotta (2024). UCDP External Support Dataset [Dataset]. https://datacatalogue.cessda.eu/detail?q=46183e962fc702dccd3b04f021bf6e0d1c5f02eb551cbf4e7287b295343cc12e&lang=en
    Explore at:
    Dataset updated
    Aug 7, 2024
    Dataset provided by
    Department of Peace and Conflict Research, Uppsala University
    Authors
    Högbladh, Stina; Pettersson, Therése; Themnér, Lotta
    Variables measured
    Event/Process/Activity
    Description

    The UCDP, Uppsala Conflict Data Program, contains information on a large number data on organised violence, armed violence, and peacemaking. There is information from 1946 up to today, and the datasets are updated continuously. The data can be downloaded for free, and available in several different versions.

    The UCDP External Support Data contains information of external support in intrastate conflicts, 1975-2010. Provides information of kind of support, extern actor and specific year. The data is divided into two separate datasets which are analogous, i.e. contain identical data structured in a different manner to simplify various types of research such as different types of statistical analyses:

    1. One dataset provide data where the unit of analysis is a warring party-year, providing information on the existence, type, and provider of external support for all warring parties (actors) coded as active in UCDP data, on an annual basis. The dataset contains information for the time-period 1975–2010. It involves 29 variables and 3606 individuals/objects.
    2. One dataset provide data where the unit of analysis is the warring party-supporter-year, i.e. each row in the dataset contains information on the type of support that a warring party receives from a specific external party in a given year, using dummy variables for each category of support. The dataset contains information for the time-period 1975–2010. It involves 30 variables and 6519 individuals/objects.
  17. u

    Data from: CLIVAR LE project

    • rda.ucar.edu
    • data.ucar.edu
    • +1more
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CLIVAR LE project [Dataset]. https://rda.ucar.edu/lookfordata/datasets/?nb=y&b=topic&v=Atmosphere
    Explore at:
    Description

    The CLIVAR Large Ensemble repository was built at NCAR and supported by the US CLIVAR WG on Large Ensembles. It features a set of CMORized variables from the following CMIP5 ... class Large Ensembles: CANESM2, CESM, CSIRO MK36, EC Earth, GFDL CM3, GFDL ESM2M, MPI, and OLENS McKinnon.

  18. P

    S2Looking Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Jun 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Li Shen; Yao Lu; Hao Chen; Hao Wei; Donghai Xie; Jiabao Yue; Rui Chen; Shouye Lv; Bitao Jiang (2024). S2Looking Dataset [Dataset]. https://paperswithcode.com/dataset/s2looking
    Explore at:
    Dataset updated
    Jun 25, 2024
    Authors
    Li Shen; Yao Lu; Hao Chen; Hao Wei; Donghai Xie; Jiabao Yue; Rui Chen; Shouye Lv; Bitao Jiang
    Description

    S2Looking is a building change detection dataset that contains large-scale side-looking satellite images captured at varying off-nadir angles. The S2Looking dataset consists of 5,000 registered bitemporal image pairs (size of 1024*1024, 0.5 ~ 0.8 m/pixel) of rural areas throughout the world and more than 65,920 annotated change instances. We provide two label maps to separately indicate the newly built and demolished building regions for each sample in the dataset. We establish a benchmark task based on this dataset, i.e., identifying the pixel-level building changes in the bi-temporal images.

  19. f

    Assessment and Improvement of Statistical Tools for Comparative Proteomics...

    • figshare.com
    • acs.figshare.com
    txt
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Veit Schwämmle; Ileana Rodríguez León; Ole Nørregaard Jensen (2023). Assessment and Improvement of Statistical Tools for Comparative Proteomics Analysis of Sparse Data Sets with Few Experimental Replicates [Dataset]. http://doi.org/10.1021/pr400045u.s002
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    ACS Publications
    Authors
    Veit Schwämmle; Ileana Rodríguez León; Ole Nørregaard Jensen
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Large-scale quantitative analyses of biological systems are often performed with few replicate experiments, leading to multiple nonidentical data sets due to missing values. For example, mass spectrometry driven proteomics experiments are frequently performed with few biological or technical replicates due to sample-scarcity or due to duty-cycle or sensitivity constraints, or limited capacity of the available instrumentation, leading to incomplete results where detection of significant feature changes becomes a challenge. This problem is further exacerbated for the detection of significant changes on the peptide level, for example, in phospho-proteomics experiments. In order to assess the extent of this problem and the implications for large-scale proteome analysis, we investigated and optimized the performance of three statistical approaches by using simulated and experimental data sets with varying numbers of missing values. We applied three tools, including standard t test, moderated t test, also known as limma, and rank products for the detection of significantly changing features in simulated and experimental proteomics data sets with missing values. The rank product method was improved to work with data sets containing missing values. Extensive analysis of simulated and experimental data sets revealed that the performance of the statistical analysis tools depended on simple properties of the data sets. High-confidence results were obtained by using the limma and rank products methods for analyses of triplicate data sets that exhibited more than 1000 features and more than 50% missing values. The maximum number of differentially represented features was identified by using limma and rank products methods in a complementary manner. We therefore recommend combined usage of these methods as a novel and optimal way to detect significantly changing features in these data sets. This approach is suitable for large quantitative data sets from stable isotope labeling and mass spectrometry experiments and should be applicable to large data sets of any type. An R script that implements the improved rank products algorithm and the combined analysis is available.

  20. A sample medical dataset.

    • plos.figshare.com
    xls
    Updated May 31, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Farough Ashkouti; Keyhan Khamforoosh (2023). A sample medical dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0285212.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Farough Ashkouti; Keyhan Khamforoosh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Recently big data and its applications had sharp growth in various fields such as IoT, bioinformatics, eCommerce, and social media. The huge volume of data incurred enormous challenges to the architecture, infrastructure, and computing capacity of IT systems. Therefore, the compelling need of the scientific and industrial community is large-scale and robust computing systems. Since one of the characteristics of big data is value, data should be published for analysts to extract useful patterns from them. However, data publishing may lead to the disclosure of individuals’ private information. Among the modern parallel computing platforms, Apache Spark is a fast and in-memory computing framework for large-scale data processing that provides high scalability by introducing the resilient distributed dataset (RDDs). In terms of performance, Due to in-memory computations, it is 100 times faster than Hadoop. Therefore, Apache Spark is one of the essential frameworks to implement distributed methods for privacy-preserving in big data publishing (PPBDP). This paper uses the RDD programming of Apache Spark to propose an efficient parallel implementation of a new computing model for big data anonymization. This computing model has three-phase of in-memory computations to address the runtime, scalability, and performance of large-scale data anonymization. The model supports partition-based data clustering algorithms to preserve the λ-diversity privacy model by using transformation and actions on RDDs. Therefore, the authors have investigated Spark-based implementation for preserving the λ-diversity privacy model by two designed City block and Pearson distance functions. The results of the paper provide a comprehensive guideline allowing the researchers to apply Apache Spark in their own researches.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
A large dataset of detection and submeter-accurate 3-D trajectories of juvenile Chinook salmon [Dataset]. https://datadryad.org/stash/dataset/doi:10.5061/dryad.tdz08kpzd

Data from: A large dataset of detection and submeter-accurate 3-D trajectories of juvenile Chinook salmon

Related Article
Explore at:
zipAvailable download formats
Dataset updated
Jul 14, 2021
Dataset provided by
Dryad
Authors
Jayson Martinez; Tao Fu; Xinya Li; Hongfei Hou; Jingxian Wang; Brad Eppard; Zhiqun Deng
Time period covered
2020
Description

Use of JSATS can generate a large volume of data. To manage and visualize the data, an integrated suite of science-based tools known as the Hydropower Biological Evaluation Toolset (HBET) can be used.

Search
Clear search
Close search
Google apps
Main menu