100+ datasets found

d
Data from: A large dataset of detection and submeter-accurate 3-D...
datadryad.org
explore.openaire.eu
+2more
zip
Updated Jul 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
A large dataset of detection and submeter-accurate 3-D trajectories of juvenile Chinook salmon [Dataset]. https://datadryad.org/stash/dataset/doi:10.5061/dryad.tdz08kpzd
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.tdz08kpzd
Dataset updated
Jul 14, 2021
Dataset provided by
Dryad
Authors
Jayson Martinez; Tao Fu; Xinya Li; Hongfei Hou; Jingxian Wang; Brad Eppard; Zhiqun Deng
Time period covered
2020
Description
Use of JSATS can generate a large volume of data. To manage and visualize the data, an integrated suite of science-based tools known as the Hydropower Biological Evaluation Toolset (HBET) can be used.

Data from: ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction

zenodo.org
data.niaid.nih.gov

csv, zip

Updated Jan 27, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Hossein Keshavarz; Hossein Keshavarz; Meiyappan Nagappan; Meiyappan Nagappan (2022). ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction [Dataset]. http://doi.org/10.5281/zenodo.5907002

Explore at:

zip, csvAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.5907002

Dataset updated

Jan 27, 2022

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Hossein Keshavarz; Hossein Keshavarz; Meiyappan Nagappan; Meiyappan Nagappan

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction

This archive contains the ApacheJIT dataset presented in the paper "ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction" as well as the replication package. The paper is submitted to MSR 2022 Data Showcase Track.

The datasets are available under directory dataset. There are 4 datasets in this directory.

1. apachejit_total.csv: This file contains the entire dataset. Commits are specified by their identifier and a set of commit metrics that are explained in the paper are provided as features. Column buggy specifies whether or not the commit introduced any bug into the system.
2. apachejit_train.csv: This file is a subset of the entire dataset. It provides a balanced set that we recommend for models that are sensitive to class imbalance. This set is obtained from the first 14 years of data (2003 to 2016).
3. apachejit_test_large.csv: This file is a subset of the entire dataset. The commits in this file are the commits from the last 3 years of data. This set is not balanced to represent a real-life scenario in a JIT model evaluation where the model is trained on historical data to be applied on future data without any modification.
4. apachejit_test_small.csv: This file is a subset of the test file explained above. Since the test file has more than 30,000 commits, we also provide a smaller test set which is still unbalanced and from the last 3 years of data.

In addition to the dataset, we also provide the scripts using which we built the dataset. These scripts are written in Python 3.8. Therefore, Python 3.8 or above is required. To set up the environment, we have provided a list of required packages in file requirements.txt. Additionally, one filtering step requires GumTree [1]. For Java, GumTree requires Java 11. For other languages, external tools are needed. Installation guide and more details can be found here.

The scripts are comprised of Python scripts under directory src and Python notebooks under directory notebooks. The Python scripts are mainly responsible for conducting GitHub search via GitHub search API and collecting commits through PyDriller Package [2]. The notebooks link the fixed issue reports with their corresponding fixing commits and apply some filtering steps. The bug-inducing candidates then are filtered again using gumtree.py script that utilizes the GumTree package. Finally, the remaining bug-inducing candidates are combined with the clean commits in the dataset_construction notebook to form the entire dataset.

More specifically, git_token.py handles GitHub API token that is necessary for requests to GitHub API. Script collector.py performs GitHub search. Tracing changed lines and git annotate is done in gitminer.py using PyDriller. Finally, gumtree.py applies 4 filtering steps (number of lines, number of files, language, and change significance).

References:

1. GumTree

* https://github.com/GumTreeDiff/gumtree

Jean-Rémy Falleri, Floréal Morandat, Xavier Blanc, Matias Martinez, and Martin Monperrus. 2014. Fine-grained and accurate source code differencing. In ACM/IEEE International Conference on Automated Software Engineering, ASE ’14,Vasteras, Sweden - September 15 - 19, 2014. 313–324

2. PyDriller

* https://pydriller.readthedocs.io/en/latest/

* Davide Spadini, Maurício Aniche, and Alberto Bacchelli. 2018. PyDriller: Python Framework for Mining Software Repositories. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Lake Buena Vista, FL, USA)(ESEC/FSE2018). Association for Computing Machinery, New York, NY, USA, 908–911

Z
Data from: A Large-scale Dataset of (Open Source) License Text Variants
data.niaid.nih.gov
zenodo.org
Updated Mar 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stefano Zacchiroli (2022). A Large-scale Dataset of (Open Source) License Text Variants [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6379163
Explore at:
Dataset updated
Mar 30, 2022
Dataset authored and provided by
Stefano Zacchiroli
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We introduce a large-scale dataset of the complete texts of free/open source software (FOSS) license variants. To assemble it we have collected from the Software Heritage archive—the largest publicly available archive of FOSS source code with accompanying development history—all versions of files whose names are commonly used to convey licensing terms to software users and developers. The dataset consists of 6.5 million unique license files that can be used to conduct empirical studies on open source licensing, training of automated license classifiers, natural language processing (NLP) analyses of legal texts, as well as historical and phylogenetic studies on FOSS licensing. Additional metadata about shipped license files are also provided, making the dataset ready to use in various contexts; they include: file length measures, detected MIME type, detected SPDX license (using ScanCode), example origin (e.g., GitHub repository), oldest public commit in which the license appeared. The dataset is released as open data as an archive file containing all deduplicated license blobs, plus several portable CSV files for metadata, referencing blobs via cryptographic checksums.

For more details see the included README file and companion paper:

Stefano Zacchiroli. A Large-scale Dataset of (Open Source) License Text Variants. In proceedings of the 2022 Mining Software Repositories Conference (MSR 2022). 23-24 May 2022 Pittsburgh, Pennsylvania, United States. ACM 2022.

If you use this dataset for research purposes, please acknowledge its use by citing the above paper.
MEI Large Data Set 5
kaggle.com
zip
Updated Feb 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ian Dickerson (2022). MEI Large Data Set 5 [Dataset]. https://www.kaggle.com/datasets/mathsian/mei-large-data-set-5/discussion
Explore at:
zip(133362 bytes)Available download formats
Dataset updated
Feb 2, 2022
Authors
Ian Dickerson
Description
Dataset

This dataset was created by Ian Dickerson

Contents
A large database of motor imagery EEG signals and users' demographic,...
zenodo.org
Updated Sep 13, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dreyer Pauline; Roc Aline; Rimbert Sébastien; Pillette Léa; Lotte Fabien; Dreyer Pauline; Roc Aline; Rimbert Sébastien; Pillette Léa; Lotte Fabien (2023). A large database of motor imagery EEG signals and users' demographic, personality and cognitive profile information for Brain-Computer Interface research [Dataset]. http://doi.org/10.5281/zenodo.7516451
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.7516451
Dataset updated
Sep 13, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Dreyer Pauline; Roc Aline; Rimbert Sébastien; Pillette Léa; Lotte Fabien; Dreyer Pauline; Roc Aline; Rimbert Sébastien; Pillette Léa; Lotte Fabien
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Context :
We share a large database containing electroencephalographic signals from 87 human participants, with more than 20,800 trials in total representing about 70 hours of recording. It was collected during brain-computer interface (BCI) experiments and organized into 3 datasets (A, B, and C) that were all recorded following the same protocol: right and left hand motor imagery (MI) tasks during one single day session.
It includes the performance of the associated BCI users, detailed information about the demographics, personality and cognitive user’s profile, and the experimental instructions and codes (executed in the open-source platform OpenViBE).
Such database could prove useful for various studies, including but not limited to: 1) studying the relationships between BCI users' profiles and their BCI performances, 2) studying how EEG signals properties varies for different users' profiles and MI tasks, 3) using the large number of participants to design cross-user BCI machine learning algorithms or 4) incorporating users' profile information into the design of EEG signal classification algorithms.

Sixty participants (Dataset A) performed the first experiment, designed in order to investigated the impact of experimenters' and users' gender on MI-BCI user training outcomes, i.e., users performance and experience, (Pillette & al). Twenty one participants (Dataset B) performed the second one, designed to examined the relationship between users' online performance (i.e., classification accuracy) and the characteristics of the chosen user-specific Most Discriminant Frequency Band (MDFB) (Benaroch & al). The only difference between the two experiments lies in the algorithm used to select the MDFB. Dataset C contains 6 additional participants who completed one of the two experiments described above. Physiological signals were measured using a g.USBAmp (g.tec, Austria), sampled at 512 Hz, and processed online using OpenViBE 2.1.0 (Dataset A) & OpenVIBE 2.2.0 (Dataset B). For Dataset C, participants C83 and C85 were collected with OpenViBE 2.1.0 and the remaining 4 participants with OpenViBE 2.2.0. Experiments were recorded at Inria Bordeaux sud-ouest, France.

Duration : Each participant's folder is composed of approximately 48 minutes EEG recording. Meaning six 7-minutes runs and a 6-minutes baseline.

Documents
Instructions: checklist read by experimenters during the experiments.
Questionnaires: the Mental Rotation test used, the translation of 4 questionnaires, notably the Demographic and Social information, the Pre and Post-session questionnaires, and the Index of Learning style. English and french version
Performance: The online OpenViBE BCI classification performances obtained by each participant are provided for each run, as well as answers to all questionnaires
Scenarios/scripts : set of OpenViBE scenarios used to perform each of the steps of the MI-BCI protocol, e.g., acquire training data, calibrate the classifier or run the online MI-BCI

Database : raw signals
Dataset A : N=60 participants
Dataset B : N=21 participants
Dataset C : N=6 participants
N
Excel, AL Age Group Population Dataset: A Complete Breakdown of Excel Age...
neilsberg.com
csv, json
Updated Jul 24, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neilsberg Research (2024). Excel, AL Age Group Population Dataset: A Complete Breakdown of Excel Age Demographics from 0 to 85 Years and Over, Distributed Across 18 Age Groups // 2024 Edition [Dataset]. https://www.neilsberg.com/research/datasets/aa8c95e0-4983-11ef-ae5d-3860777c1fe6/
Explore at:
csv, jsonAvailable download formats
Dataset updated
Jul 24, 2024
Dataset authored and provided by
Neilsberg Research
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Excel
Variables measured
Population Under 5 Years, Population over 85 years, Population Between 5 and 9 years, Population Between 10 and 14 years, Population Between 15 and 19 years, Population Between 20 and 24 years, Population Between 25 and 29 years, Population Between 30 and 34 years, Population Between 35 and 39 years, Population Between 40 and 44 years, and 9 more
Measurement technique
The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2018-2022 5-Year Estimates. To measure the two variables, namely (a) population and (b) population as a percentage of the total population, we initially analyzed and categorized the data for each of the age groups. For age groups we divided it into roughly a 5 year bucket for ages between 0 and 85. For over 85, we aggregated data into a single group for all ages. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
Dataset funded by
Neilsberg Research
Description
About this dataset

Context

The dataset tabulates the Excel population distribution across 18 age groups. It lists the population in each age group along with the percentage population relative of the total population for Excel. The dataset can be utilized to understand the population distribution of Excel by age. For example, using this dataset, we can identify the largest age group in Excel.

Key observations

The largest age group in Excel, AL was for the group of age 45 to 49 years years with a population of 74 (15.64%), according to the ACS 2018-2022 5-Year Estimates. At the same time, the smallest age group in Excel, AL was the 85 years and over years with a population of 2 (0.42%). Source: U.S. Census Bureau American Community Survey (ACS) 2018-2022 5-Year Estimates

Content

When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2018-2022 5-Year Estimates

Age groups:

Under 5 years

5 to 9 years

10 to 14 years

15 to 19 years

20 to 24 years

25 to 29 years

30 to 34 years

35 to 39 years

40 to 44 years

45 to 49 years

50 to 54 years

55 to 59 years

60 to 64 years

65 to 69 years

70 to 74 years

75 to 79 years

80 to 84 years

85 years and over

Variables / Data Columns

Age Group: This column displays the age group in consideration

Population: The population for the specific age group in the Excel is shown in this column.

% of Total Population: This column displays the population of each age group as a proportion of Excel total population. Please note that the sum of all percentages may not equal one due to rounding of values.

Good to know

Margin of Error

Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

Custom data

If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

Inspiration

Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

Recommended for further research

This dataset is a part of the main dataset for Excel Population by Age. You can refer the same here
u
Data from: Current and projected research data storage needs of Agricultural...
agdatacommons.nal.usda.gov
datasets.ai
+4more
pdf
Updated Nov 30, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cynthia Parr (2023). Current and projected research data storage needs of Agricultural Research Service researchers in 2016 [Dataset]. http://doi.org/10.15482/USDA.ADC/1346946
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.15482/USDA.ADC/1346946
Dataset updated
Nov 30, 2023
Dataset provided by
Ag Data Commons
Authors
Cynthia Parr
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The USDA Agricultural Research Service (ARS) recently established SCINet , which consists of a shared high performance computing resource, Ceres, and the dedicated high-speed Internet2 network used to access Ceres. Current and potential SCINet users are using and generating very large datasets so SCINet needs to be provisioned with adequate data storage for their active computing. It is not designed to hold data beyond active research phases. At the same time, the National Agricultural Library has been developing the Ag Data Commons, a research data catalog and repository designed for public data release and professional data curation. Ag Data Commons needs to anticipate the size and nature of data it will be tasked with handling. The ARS Web-enabled Databases Working Group, organized under the SCINet initiative, conducted a study to establish baseline data storage needs and practices, and to make projections that could inform future infrastructure design, purchases, and policies. The SCINet Web-enabled Databases Working Group helped develop the survey which is the basis for an internal report. While the report was for internal use, the survey and resulting data may be generally useful and are being released publicly. From October 24 to November 8, 2016 we administered a 17-question survey (Appendix A) by emailing a Survey Monkey link to all ARS Research Leaders, intending to cover data storage needs of all 1,675 SY (Category 1 and Category 4) scientists. We designed the survey to accommodate either individual researcher responses or group responses. Research Leaders could decide, based on their unit's practices or their management preferences, whether to delegate response to a data management expert in their unit, to all members of their unit, or to themselves collate responses from their unit before reporting in the survey.
Larger storage ranges cover vastly different amounts of data so the implications here could be significant depending on whether the true amount is at the lower or higher end of the range. Therefore, we requested more detail from "Big Data users," those 47 respondents who indicated they had more than 10 to 100 TB or over 100 TB total current data (Q5). All other respondents are called "Small Data users." Because not all of these follow-up requests were successful, we used actual follow-up responses to estimate likely responses for those who did not respond. We defined active data as data that would be used within the next six months. All other data would be considered inactive, or archival. To calculate per person storage needs we used the high end of the reported range divided by 1 for an individual response, or by G, the number of individuals in a group response. For Big Data users we used the actual reported values or estimated likely values.

Resources in this dataset:Resource Title: Appendix A: ARS data storage survey questions. File Name: Appendix A.pdfResource Description: The full list of questions asked with the possible responses. The survey was not administered using this PDF but the PDF was generated directly from the administered survey using the Print option under Design Survey. Asterisked questions were required. A list of Research Units and their associated codes was provided in a drop down not shown here. Resource Software Recommended: Adobe Acrobat,url: https://get.adobe.com/reader/ Resource Title: CSV of Responses from ARS Researcher Data Storage Survey. File Name: Machine-readable survey response data.csvResource Description: CSV file includes raw responses from the administered survey, as downloaded unfiltered from Survey Monkey, including incomplete responses. Also includes additional classification and calculations to support analysis. Individual email addresses and IP addresses have been removed. This information is that same data as in the Excel spreadsheet (also provided).Resource Title: Responses from ARS Researcher Data Storage Survey. File Name: Data Storage Survey Data for public release.xlsxResource Description: MS Excel worksheet that Includes raw responses from the administered survey, as downloaded unfiltered from Survey Monkey, including incomplete responses. Also includes additional classification and calculations to support analysis. Individual email addresses and IP addresses have been removed.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel
T
imdb_reviews
tensorflow.org
Updated Sep 20, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). imdb_reviews [Dataset]. https://www.tensorflow.org/datasets/catalog/imdb_reviews
Explore at:
Dataset updated
Sep 20, 2024
Description
Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('imdb_reviews', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
c
SAROS - A large, heterogeneous, and sparsely annotated segmentation dataset...
cancerimagingarchive.net
csv, n/a +1
Updated Mar 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Cancer Imaging Archive (2024). SAROS - A large, heterogeneous, and sparsely annotated segmentation dataset on CT imaging data [Dataset]. http://doi.org/10.25737/SZ96-ZG60
Explore at:
csv, n/a, nifti and zipAvailable download formats
Unique identifier
https://doi.org/10.25737/SZ96-ZG60
Dataset updated
Mar 7, 2024
Dataset authored and provided by
The Cancer Imaging Archive
License
https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
Time period covered
Mar 7, 2024
Dataset funded by
National Cancer Institutehttp://www.cancer.gov/
Description
Sparsely Annotated Region and Organ Segmentation (SAROS) contributes a large heterogeneous semantic segmentation annotation dataset for existing CT imaging cases on TCIA. The goal of this dataset is to provide high-quality annotations for building body composition analysis tools (References: Koitka 2020 and Haubold 2023). Existing in-house segmentation models were employed to generate annotation candidates on randomly selected cases. All generated annotations were manually reviewed and corrected by medical residents and students on every fifth axial slice while other slices were set to an ignore label (numeric value 255). 900 CT series from 882 patients were randomly selected from the following TCIA collections (number of CTs per collection in parenthesis): ACRIN-FLT-Breast (32), ACRIN-HNSCC-FDG-PET/CT (48), ACRIN-NSCLC-FDG-PET (129), Anti-PD-1_Lung (12), Anti-PD-1_MELANOMA (2), C4KC-KiTS (175), COVID-19-NY-SBU (1), CPTAC-CM (1), CPTAC-LSCC (3), CPTAC-LUAD (1), CPTAC-PDA (8), CPTAC-UCEC (26), HNSCC (17), Head-Neck Cetuximab (12), LIDC-IDRI (133), Lung-PET-CT-Dx (17), NSCLC Radiogenomics (7), NSCLC-Radiomics (56), NSCLC-Radiomics-Genomics (20), Pancreas-CT (58), QIN-HEADNECK (94), Soft-tissue-Sarcoma (6), TCGA-HNSC (1), TCGA-LIHC (33), TCGA-LUAD (2), TCGA-LUSC (3), TCGA-STAD (2), TCGA-UCEC (1). A script to download and resample the images is provided in our GitHub repository: https://github.com/UMEssen/saros-dataset The annotations are provided in NIfTI format and were performed on 5mm slice thickness. The annotation files define foreground labels on the same axial slices and match pixel-perfect. In total, 13 semantic body regions and 6 body part labels were annotated with an index that corresponds to a numeric value in the segmentation file.
Body Regions

Subcutaneous Tissue

Muscle

Abdominal Cavity

Thoracic Cavity

Bones

Parotid Glands

Pericardium

Breast Implant

Mediastinum

Brain

Spinal Cord

Thyroid Glands

Submandibular Glands

Body Parts

Torso

Head

Right Leg

Left Leg

Right Arm

Left Arm

The labels which were modified or require further commentary are listed and explained below:

Subcutaneous Adipose Tissue: The cutis was included into this label due to its limited differentiation in 5mm-CT.

Muscle: All muscular tissue was segmented contiguously and not separated into single muscles. Thus, fascias and intermuscular fat were included into the label. Inter- and intramuscular fat is subtracted automatically in the process.

Abdominal Cavity: This label includes the pelvis. The label does not separate between the positional relationships of the peritoneum.

Mediastinum: The International Thymic Malignancy Group (ITMIG) scheme was used for the segmentation guidelines.

Head + Neck: The neck is confined by the base of the trapezius muscle.

Right + Left Leg: The legs are separated from the torso by the line between the two lowest points of the Rami ossa pubis.

Right + Left Arm: The arms are separated from the torso by the diagonal between the most lateral point of the acromion and the tuberculum infraglenoidale.

For reproducibility on downstream tasks, five cross-validation folds and a test set were pre-defined and are described in the provided spreadsheet. Segmentation was conducted strictly in accordance with anatomical guidelines and only modified if required for the gain of segmentation efficiency.
f
Large scale API Usage dataset
figshare.com
data.4tu.nl
bin
Updated Jun 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anand Sawant (2023). Large scale API Usage dataset [Dataset]. http://doi.org/10.4121/uuid:cb751e3e-3034-44a1-b0c1-b23128927dd8
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.4121/uuid:cb751e3e-3034-44a1-b0c1-b23128927dd8
Dataset updated
Jun 6, 2023
Dataset provided by
4TU.ResearchData
Authors
Anand Sawant
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This is data collected from 50 APIs on their usage among over 200,000 GitHub consumers.
Data from: Criminal Recidivism in a Large Cohort of Offenders Released from...
catalog.data.gov
s.cnmilf.com
+1more
Updated Mar 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Justice (2025). Criminal Recidivism in a Large Cohort of Offenders Released from Prison in Florida, 2004-2008 [Dataset]. https://catalog.data.gov/dataset/criminal-recidivism-in-a-large-cohort-of-offenders-released-from-prison-in-florida-2004-20-98557
Explore at:
Dataset updated
Mar 12, 2025
Dataset provided by
National Institute of Justicehttp://nij.ojp.gov/
Area covered
Florida
Description
The purpose of the study was to quantify the effect of the embrace of DNA technology on offender behavior. In particular, researchers examined whether an offender's knowledge that their DNA profile was entered into a database deterred them from offending in the future and if probative effects resulted from DNA sampling. The researchers coded information using criminal history records and data from Florida's DNA database, both of which are maintained by the Florida Department of Law Enforcement (FDLE), and also utilized court docket information acquired through the Florida Department of Corrections (FDOC) to create a dataset of 156,702 cases involving offenders released from the FDOC in the state of Florida between January 1996 and December 2004. The data contain a total of 50 variables. Major categories of variables include demographic variables regarding the offender, descriptive variables relating to the initial crime committed by the offender, and time-specific variables regarding cases of recidivism.
N
Big Stone Gap, VA Age Group Population Dataset: A Complete Breakdown of Big...
neilsberg.com
csv, json
Updated Feb 22, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Big Stone Gap, VA Age Group Population Dataset: A Complete Breakdown of Big Stone Gap Age Demographics from 0 to 85 Years and Over, Distributed Across 18 Age Groups // 2025 Edition [Dataset]. https://www.neilsberg.com/insights/big-stone-gap-va-population-by-age/
Explore at:
csv, jsonAvailable download formats
Dataset updated
Feb 22, 2025
Dataset authored and provided by
Neilsberg Research
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Big Stone Gap, Virginia
Variables measured
Population Under 5 Years, Population over 85 years, Population Between 5 and 9 years, Population Between 10 and 14 years, Population Between 15 and 19 years, Population Between 20 and 24 years, Population Between 25 and 29 years, Population Between 30 and 34 years, Population Between 35 and 39 years, Population Between 40 and 44 years, and 9 more
Measurement technique
The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. To measure the two variables, namely (a) population and (b) population as a percentage of the total population, we initially analyzed and categorized the data for each of the age groups. For age groups we divided it into roughly a 5 year bucket for ages between 0 and 85. For over 85, we aggregated data into a single group for all ages. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
Dataset funded by
Neilsberg Research
Description
About this dataset

Context

The dataset tabulates the Big Stone Gap population distribution across 18 age groups. It lists the population in each age group along with the percentage population relative of the total population for Big Stone Gap. The dataset can be utilized to understand the population distribution of Big Stone Gap by age. For example, using this dataset, we can identify the largest age group in Big Stone Gap.

Key observations

The largest age group in Big Stone Gap, VA was for the group of age 30 to 34 years years with a population of 602 (11.59%), according to the ACS 2019-2023 5-Year Estimates. At the same time, the smallest age group in Big Stone Gap, VA was the 85 years and over years with a population of 57 (1.10%). Source: U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates

Content

When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates

Age groups:

Under 5 years

5 to 9 years

10 to 14 years

15 to 19 years

20 to 24 years

25 to 29 years

30 to 34 years

35 to 39 years

40 to 44 years

45 to 49 years

50 to 54 years

55 to 59 years

60 to 64 years

65 to 69 years

70 to 74 years

75 to 79 years

80 to 84 years

85 years and over

Variables / Data Columns

Age Group: This column displays the age group in consideration

Population: The population for the specific age group in the Big Stone Gap is shown in this column.

% of Total Population: This column displays the population of each age group as a proportion of Big Stone Gap total population. Please note that the sum of all percentages may not equal one due to rounding of values.

Good to know

Margin of Error

Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

Custom data

If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

Inspiration

Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

Recommended for further research

This dataset is a part of the main dataset for Big Stone Gap Population by Age. You can refer the same here
m
Motamot: A Dataset for Revealing the Supremacy of Large Language Models over...
data.mendeley.com
Updated May 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fatema Tuj Johora Faria (2024). Motamot: A Dataset for Revealing the Supremacy of Large Language Models over Transformer Models in Bengali Political Sentiment Analysis [Dataset]. http://doi.org/10.17632/hdhnrrwdz2.1
Explore at:
Unique identifier
https://doi.org/10.17632/hdhnrrwdz2.1
Dataset updated
May 13, 2024
Authors
Fatema Tuj Johora Faria
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset "Motamot" containing 7,058 data points labeled with Positive and Negative sentiments, tailored specifically for Political Sentiment Analysis in the Bengali language. The dataset comprises 4,132 instances labeled as Positive and 2,926 instances labeled as Negative sentiments.

Specifics of the Core Data: —------------------------------- Train 5647, Test 706, Validation 705

Train : —-------------------------------

Positive: 3306

Negative: 2341

Test : —-------------------------------

Positive: 413

Negative: 293

Validation : —-------------------------------

Positive: 413

Negative: 292
P
Bridge Data Dataset
paperswithcode.com
opendatalab.com
Updated Aug 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Frederik Ebert; Yanlai Yang; Karl Schmeckpeper; Bernadette Bucher; Georgios Georgakis; Kostas Daniilidis; Chelsea Finn; Sergey Levine (2024). Bridge Data Dataset [Dataset]. https://paperswithcode.com/dataset/bridge-data
Explore at:
Dataset updated
Aug 20, 2024
Authors
Frederik Ebert; Yanlai Yang; Karl Schmeckpeper; Bernadette Bucher; Georgios Georgakis; Kostas Daniilidis; Chelsea Finn; Sergey Levine
Description
Bridge Data is a large multi-domain and multi-task dataset, with 7,200 demonstrations constituting 71 tasks across 10 environments. The dataset is collected using a low-cost yet versatile 6-DoF WidowX250 robot arm and contains 7,200 demonstrations of a robot performing 71 kitchen tasks across 10 environments with varying lighting, robot positions, and backgrounds. It can be used to boosting generalization of robotic skills and empirically study how it can improve the learning of new tasks in new environments.
u
Data from: MobileWell400+: A Large-Scale Multivariate Longitudinal Mobile...
produccioncientifica.ucm.es
produccioncientifica.ugr.es
+1more
Updated 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Banos, Oresti; Damas, Miguel; Goicoechea, Carmen; Perakakis, Pandelis; Pomares, Hector; Rodriguez-Leon, Ciro; Sanabria, Daniel; Villalonga, Claudia; Banos, Oresti; Damas, Miguel; Goicoechea, Carmen; Perakakis, Pandelis; Pomares, Hector; Rodriguez-Leon, Ciro; Sanabria, Daniel; Villalonga, Claudia (2024). MobileWell400+: A Large-Scale Multivariate Longitudinal Mobile Dataset for Investigating Individual and Collective Well-Being [Dataset]. https://produccioncientifica.ucm.es/documentos/668fc499b9e7c03b01be2372
Explore at:
Dataset updated
2024
Authors
Banos, Oresti; Damas, Miguel; Goicoechea, Carmen; Perakakis, Pandelis; Pomares, Hector; Rodriguez-Leon, Ciro; Sanabria, Daniel; Villalonga, Claudia; Banos, Oresti; Damas, Miguel; Goicoechea, Carmen; Perakakis, Pandelis; Pomares, Hector; Rodriguez-Leon, Ciro; Sanabria, Daniel; Villalonga, Claudia
Description
This study engaged 409 participants over a period spanning from July 10 to August 8, 2023, ensuring representation across various demographic factors: 221 females, 186 males, 2 non-binary, year of birth between 1951 and 2005, with varied annual incomes and from 15 Spanish regions. The MobileWell400+ dataset, openly accessible, encompasses a wide array of data collected via the participants' mobile phone, including demographic, emotional, social, behavioral, and well-being data. Methodologically, the project presents a promising avenue for uncovering new social, behavioral, and emotional indicators, supplementing existing literature. Notably, artificial intelligence is considered to be instrumental in analysing these data, discerning patterns, and forecasting trends, thereby advancing our comprehension of individual and population well-being. Ethical standards were upheld, with participants providing informed consent.

The following is a non-exhaustive list of collected data:

Data continuously collected through the participants' smartphone sensors: physical activity (resting, walking, driving, cycling, etc.), name of detected WiFi networks, connectivity type (WiFi, mobile, none), ambient light, ambient noise, and status of the device screen (on, off, locked, unlocked).

Data corresponding to an initial survey prompted via the smartphone, with information related to demographic data, effects and COVID vaccination, average hours of physical activity, and answers to a series of questions to measure mental health, many of them taken from internationally recognised psychological and well-being scales (PANAS, PHQ, GAD, BRS and AAQ), social isolation (TILS) and economic inequality perception.

Data corresponding to daily surveys prompted via the smartphone, where variables related to mood (valence, activation, energy and emotional events) and social interaction (quantity and quality) are measured.

Data corresponding to weekly surveys prompted via the smartphone, where information on overall health, hours of physical activity per week, lonileness, and questions related to well-being are asked.

Data corresponding to an final survey prompted via the smartphone, consisting of similar questions to the ones asked in the initial survey, namely psychological and well-being items (PANAS, PHQ, GAD, BRS and AAQ), social isolation (TILS) and economic inequality perception questions.

For a more detailed description of the study please refer to MobileWell400+StudyDescription.pdf.

For a more detailed description of the collected data, variables and data files please refer to MobileWell400+FilesDescription.pdf.
c
UCDP External Support Dataset
datacatalogue.cessda.eu
snd.se
Updated Aug 7, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Högbladh, Stina; Pettersson, Therése; Themnér, Lotta (2024). UCDP External Support Dataset [Dataset]. https://datacatalogue.cessda.eu/detail?q=46183e962fc702dccd3b04f021bf6e0d1c5f02eb551cbf4e7287b295343cc12e&lang=en
Explore at:
Dataset updated
Aug 7, 2024
Dataset provided by
Department of Peace and Conflict Research, Uppsala University
Authors
Högbladh, Stina; Pettersson, Therése; Themnér, Lotta
Variables measured
Event/Process/Activity
Description
The UCDP, Uppsala Conflict Data Program, contains information on a large number data on organised violence, armed violence, and peacemaking. There is information from 1946 up to today, and the datasets are updated continuously. The data can be downloaded for free, and available in several different versions.

The UCDP External Support Data contains information of external support in intrastate conflicts, 1975-2010. Provides information of kind of support, extern actor and specific year. The data is divided into two separate datasets which are analogous, i.e. contain identical data structured in a different manner to simplify various types of research such as different types of statistical analyses:

One dataset provide data where the unit of analysis is a warring party-year, providing information on the existence, type, and provider of external support for all warring parties (actors) coded as active in UCDP data, on an annual basis. The dataset contains information for the time-period 1975–2010. It involves 29 variables and 3606 individuals/objects.

One dataset provide data where the unit of analysis is the warring party-supporter-year, i.e. each row in the dataset contains information on the type of support that a warring party receives from a specific external party in a given year, using dummy variables for each category of support. The dataset contains information for the time-period 1975–2010. It involves 30 variables and 6519 individuals/objects.
u
Data from: CLIVAR LE project
rda.ucar.edu
data.ucar.edu
+1more
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CLIVAR LE project [Dataset]. https://rda.ucar.edu/lookfordata/datasets/?nb=y&b=topic&v=Atmosphere
Explore at:
Description
The CLIVAR Large Ensemble repository was built at NCAR and supported by the US CLIVAR WG on Large Ensembles. It features a set of CMORized variables from the following CMIP5 ... class Large Ensembles: CANESM2, CESM, CSIRO MK36, EC Earth, GFDL CM3, GFDL ESM2M, MPI, and OLENS McKinnon.
P
S2Looking Dataset
paperswithcode.com
opendatalab.com
Updated Jun 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Li Shen; Yao Lu; Hao Chen; Hao Wei; Donghai Xie; Jiabao Yue; Rui Chen; Shouye Lv; Bitao Jiang (2024). S2Looking Dataset [Dataset]. https://paperswithcode.com/dataset/s2looking
Explore at:
Dataset updated
Jun 25, 2024
Authors
Li Shen; Yao Lu; Hao Chen; Hao Wei; Donghai Xie; Jiabao Yue; Rui Chen; Shouye Lv; Bitao Jiang
Description
S2Looking is a building change detection dataset that contains large-scale side-looking satellite images captured at varying off-nadir angles. The S2Looking dataset consists of 5,000 registered bitemporal image pairs (size of 1024*1024, 0.5 ~ 0.8 m/pixel) of rural areas throughout the world and more than 65,920 annotated change instances. We provide two label maps to separately indicate the newly built and demolished building regions for each sample in the dataset. We establish a benchmark task based on this dataset, i.e., identifying the pixel-level building changes in the bi-temporal images.
f
Assessment and Improvement of Statistical Tools for Comparative Proteomics...
figshare.com
acs.figshare.com
txt
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Veit Schwämmle; Ileana Rodríguez León; Ole Nørregaard Jensen (2023). Assessment and Improvement of Statistical Tools for Comparative Proteomics Analysis of Sparse Data Sets with Few Experimental Replicates [Dataset]. http://doi.org/10.1021/pr400045u.s002
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1021/pr400045u.s002
Dataset updated
Jun 3, 2023
Dataset provided by
ACS Publications
Authors
Veit Schwämmle; Ileana Rodríguez León; Ole Nørregaard Jensen
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Large-scale quantitative analyses of biological systems are often performed with few replicate experiments, leading to multiple nonidentical data sets due to missing values. For example, mass spectrometry driven proteomics experiments are frequently performed with few biological or technical replicates due to sample-scarcity or due to duty-cycle or sensitivity constraints, or limited capacity of the available instrumentation, leading to incomplete results where detection of significant feature changes becomes a challenge. This problem is further exacerbated for the detection of significant changes on the peptide level, for example, in phospho-proteomics experiments. In order to assess the extent of this problem and the implications for large-scale proteome analysis, we investigated and optimized the performance of three statistical approaches by using simulated and experimental data sets with varying numbers of missing values. We applied three tools, including standard t test, moderated t test, also known as limma, and rank products for the detection of significantly changing features in simulated and experimental proteomics data sets with missing values. The rank product method was improved to work with data sets containing missing values. Extensive analysis of simulated and experimental data sets revealed that the performance of the statistical analysis tools depended on simple properties of the data sets. High-confidence results were obtained by using the limma and rank products methods for analyses of triplicate data sets that exhibited more than 1000 features and more than 50% missing values. The maximum number of differentially represented features was identified by using limma and rank products methods in a complementary manner. We therefore recommend combined usage of these methods as a novel and optimal way to detect significantly changing features in these data sets. This approach is suitable for large quantitative data sets from stable isotope labeling and mass spectrometry experiments and should be applicable to large data sets of any type. An R script that implements the improved rank products algorithm and the combined analysis is available.
A sample medical dataset.
plos.figshare.com
xls
Updated May 31, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Farough Ashkouti; Keyhan Khamforoosh (2023). A sample medical dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0285212.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0285212.t001
Dataset updated
May 31, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Farough Ashkouti; Keyhan Khamforoosh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Recently big data and its applications had sharp growth in various fields such as IoT, bioinformatics, eCommerce, and social media. The huge volume of data incurred enormous challenges to the architecture, infrastructure, and computing capacity of IT systems. Therefore, the compelling need of the scientific and industrial community is large-scale and robust computing systems. Since one of the characteristics of big data is value, data should be published for analysts to extract useful patterns from them. However, data publishing may lead to the disclosure of individuals’ private information. Among the modern parallel computing platforms, Apache Spark is a fast and in-memory computing framework for large-scale data processing that provides high scalability by introducing the resilient distributed dataset (RDDs). In terms of performance, Due to in-memory computations, it is 100 times faster than Hadoop. Therefore, Apache Spark is one of the essential frameworks to implement distributed methods for privacy-preserving in big data publishing (PPBDP). This paper uses the RDD programming of Apache Spark to propose an efficient parallel implementation of a new computing model for big data anonymization. This computing model has three-phase of in-memory computations to address the runtime, scalability, and performance of large-scale data anonymization. The model supports partition-based data clustering algorithms to preserve the λ-diversity privacy model by using transformation and actions on RDDs. Therefore, the authors have investigated Spark-based implementation for preserving the λ-diversity privacy model by two designed City block and Pearson distance functions. The results of the paper provide a comprehensive guideline allowing the researchers to apply Apache Spark in their own researches.

Facebook

Twitter

Click to copy link

Link copied

Cite

A large dataset of detection and submeter-accurate 3-D trajectories of juvenile Chinook salmon [Dataset]. https://datadryad.org/stash/dataset/doi:10.5061/dryad.tdz08kpzd

Data from: A large dataset of detection and submeter-accurate 3-D trajectories of juvenile Chinook salmon

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5061/dryad.tdz08kpzd

Dataset updated

Jul 14, 2021

Dataset provided by

Dryad

Authors

Jayson Martinez; Tao Fu; Xinya Li; Hongfei Hou; Jingxian Wang; Brad Eppard; Zhiqun Deng

Time period covered

2020

Description

Use of JSATS can generate a large volume of data. To manage and visualize the data, an integrated suite of science-based tools known as the Hydropower Biological Evaluation Toolset (HBET) can be used.

Clear search

Close search

Google apps

Main menu

Data from: A large dataset of detection and submeter-accurate 3-D...

Data from: ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction

Data from: A Large-scale Dataset of (Open Source) License Text Variants

MEI Large Data Set 5

Dataset

Contents

A large database of motor imagery EEG signals and users' demographic,...

Excel, AL Age Group Population Dataset: A Complete Breakdown of Excel Age...

About this dataset

Content

Inspiration

Recommended for further research

Data from: Current and projected research data storage needs of Agricultural...

imdb_reviews

SAROS - A large, heterogeneous, and sparsely annotated segmentation dataset...

Body Regions

Body Parts

Large scale API Usage dataset

Data from: Criminal Recidivism in a Large Cohort of Offenders Released from...

Big Stone Gap, VA Age Group Population Dataset: A Complete Breakdown of Big...

About this dataset

Content

Inspiration

Recommended for further research

Motamot: A Dataset for Revealing the Supremacy of Large Language Models over...

Bridge Data Dataset

Data from: MobileWell400+: A Large-Scale Multivariate Longitudinal Mobile...

UCDP External Support Dataset

Data from: CLIVAR LE project

S2Looking Dataset

Assessment and Improvement of Statistical Tools for Comparative Proteomics...

A sample medical dataset.

Data from: A large dataset of detection and submeter-accurate 3-D trajectories of juvenile Chinook salmon