Facebook
TwitterDataset Card for c-sharp-coding-dataset
This dataset has been created with distilabel.
Dataset Summary
This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/dmeldrum6/c-sharp-coding-dataset/raw/main/pipeline.yaml"
or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/dmeldrum6/c-sharp-coding-dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Hi-C is one of the main methods for investigating spatial co-localisation of DNA in the nucleus. However, the raw sequencing data obtained from Hi-C experiments suffer from large biases and spurious contacts, making it difficult to identify true interactions. Existing methods use complex models to account for biases and do not provide a significance threshold for detecting interactions. Here we introduce a simple binomial probabilistic model that resolves complex biases and distinguishes between true and false interactions. The model corrects biases of known and unknown origin and yields a p-value for each interaction, providing a reliable threshold based on significance. We demonstrate this experimentally by testing the method against a random ligation dataset. Our method outperforms previous methods and provides a statistical framework for further data analysis, such as comparisons of Hi-C interactions between different conditions. GOTHiC is available as a BioConductor package (http://www.bioconductor.org/packages/release/bioc/html/GOTHiC.html).
Facebook
TwitterThese datasets contain C-V2X network communication and interoperability testing packet data collected using a network sniffer (Wireshark) in the Packet Capture (PCAP) format and converted into the Packet Description Markup Language (PDML) format. These datasets include three testcases: C-V2I, C-V2V, and C-V2X. These datasets can be used to display, analyze, and assess C-V2X compatibility and interoperability among commercial on-board units (OBUs) and road-side units (RSUs) based on IEEE 1609.2, IEEE 1609.3, and SAE J2735 standards.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
EDDEN stands for *E*valuation of *D*MRI *DEN*oising approaches. The data correspond to the publication: Manzano Patron, J.P., Moeller, S., Andersson, J.L.R., Yacoub, E., Sotiropoulos, S.N.. Denoising Diffusion MRI: Considerations and implications for analysis. doi: https://doi.org/10.1101/2023.07.24.550348. Please, cite it if you use this dataset.
Description of the dataset RAW Complex data (magnitude and phase) is acquired for a single subject at different SNR/resolution regimes, under ~/EDDEN/sub-01/ses-XXX/dwi/:
Dataset A (2mm)
Dataset B (1p5mm):
Dataset C (0p9mm):
Each dataset contains their own T1w-MPRAGE under ~/EDDEN/sub-01/ses-XXX/anat/. Each data set was acquired on a different day, to minimise fatigue, but all repeats within a dataset were acquired in the same session. All acquisitions were obtained parallel to the anterior and posterior commissure line, covering the entire cerebrum.
DERIVATIVES Here are the different denoised version of the raw data for the different datasets, the pre-processed data for the raw, denoised and averages, and the FA, MD and V1 outputs from the DTI model fitting (see *Data pre-processin section below). - Denoised data: - NLM (NLM), for Non-Local Means denoising applied to magnitude raw data. - MPPCA (|MPPCA|), for Marchenko-Pastur PCA denoising applied to magnitude raw data. - MPPCA_complex (MPPCA*), for Marchenko-Pastur PCA denoising applied to complex raw data. - NORDIC (NORDIC), for NORDIC applied to complex raw data. - AVG_mag (|AVG|), for the average of the multiple repeats in magnitude. - AVG_complex (AVG*), for the average in the complex space of the multiple repeats. - Masks: Under ~/EDDEN/derivatives/ses-XXX/masks we can find different masks for each dataset: - GM_mask: Gray Matter mask. - WM_mask: White Matter mask. - CC_mask: Corpus Callosum Matter mask. - CS_mask: Centrum Semiovale mask. - ventricles_mask: CSF ventricles mask. - nodif_brain_mask: Eroded brain mask.
Having the magnitude and complex data for each dataset, denoising was applied using different approaches prior to any pre-processing to minimise potential changes in statistical properties of the raw data due to interpolations (Veraart et al., 2016b). For denoising, we used the following four algorithms:
- **Denoising in the magnitude domain**: i) The Non-Local Means (**NLM**) (Buades et al., 2005) was applied as an exemplar of a simple non-linear filtering method adapted from traditional signal pre-processing. We used the default implementation in DIPY (Garyfallidis et al., 2014), where each dMRI volume is denoised independently. ii) The Marchenko-Pastur PCA (MPPCA) (denoted as **|MPPCA|** throughout the text) (Cordero-Grande et al., 2019; Veraart et al., 2016b), reflecting a commonly used approach that performs PCA over image patches and uses the MP theorem to identify noise components from the eigenspectrum. We used the default MrTrix3 implementation (Tournier et al., 2019).
- **Denoising in the complex domain**: i) MPPCA applied to complex data (rotated along the real axis), denoted as **MPPCA***. We applied the MrTrix3 implementation of the magnitude MPPCA to the complex data rotated to the real axis (we found that this approach was more stable in terms of handling phase images and achieved better denoising, compared to the MrTrix3 complex MPPCA implementation). ii) The **NORDIC** algorithm (Moeller et al., 2021a), which also relies on the MP theorem, but performs variance spatial normalisation prior to noise component identification and filtering, to ensure noise stationarity assumptions are fulfilled.
All data, both raw and their four denoised versions, underwent the same pre-processing steps for distortion and motion correction (Sotiropoulos et al., 2013b) using an in-house pipeline (Mohammadi-Nejad et al., 2019). To avoid confounds from potential misalignment in the distortion-corrected diffusion native space obtained from each approach, we chose to compute a single susceptibility-induced off-resonance fieldmap using the raw data for each of the Datasets A, B and C; and then use the corresponding fieldmap for all denoising approaches in each dataset so that the reference native space stays the same for each of A, B and C. Note that differences between fieldmaps before and after denoising are small anyway, as the relatively high SNR b = 0 s/mm2 images are used to estimate them. But these small differences can cause noticeable misalignments between methods and confounds when attempting quantitative comparisons, which we avoid here using our approach. Hence, for each of the Datasets A, B and C, the raw blip-reversed b = 0 s/mm2 were used in FSL’s topup to generate a fieldmap (Andersson and Skare, 2002). This was then used into individual runs of FSL’s eddy for each approach (Andersson and Sotiropoulos, 2016) that applied the common fieldmap and performed corrections for eddy current and subject motion in a single interpolation step. FSL’s eddyqc (Bastiani et al.,2019) was used to generate quality control (QC) metrics, including SNR and angular CNR for each b value. The same T1w image was used within each dataset. A linear transformation estimated using with boundary-based registration (Greve and Fischl, 2009) was obtained from the corrected native diffusion space to the T1w space. The T1w image was skull-stripped and non-linearly registered to the MNI standard space allowing further analysis. Masks of white and grey matter were obtained from the T1w image using FSL’s FAST (Jenkinson et al., 2012) and they were aligned to diffusion space.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By [source]
This dataset provides highly complex physical gene regulatory networks in young adult wild-type (WT) C.elegans worms. With a total of 239,001 regulatory interactions collected from 289 datasets, this dataset is a great resource for studying gene regulation and exploring how this gene activity contributes to organism function under varying bio-environmental conditions. Our collection of datasets contains 126 genes and 495 transcription factors, along with functional knockdown data that has been used to validate the physical gene regulatory networks present in the young adult C.elegans worms. Moreover, researchers and biologists can leverage this data to gain valuable insights on how various genotypes, ages and strains are associated with different perturbations in their biological features and ultimately uncover new discoveries about the network of relationships that exist between these genes inside animals. This comprehensive dataset will be essential for conducting research related to such topics as life development processes or age-related diseases - further enriching our understanding of life!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This guide will help you understand how to use this dataset of physical gene regulatory networks to research and analyze young adult C.elegans worms.
Understand the columns in the dataset: In this dataset, there are 239,001 regulatory interactions from 289 datasets consisting of 126 genes and 495 transcription factors registered with their genotype, age, strain, perturbation type, data type, data source and source used. Additionally, comments and regulator are also included in the columns for more information about each interaction.
Know your research goal: Determine what it is you wish to discover when working with this dataset so that you can work efficiently when sorting or exploring the data within it. Knowing your goals for the analysis will be helpful for deciding which column may provide valuable insights in relation to our project objectives when doing any kind of filter or sorting within the internal structure of our database file itself.
Analyzing Specific Types Of Data: Once your goals have been established it is then important to start analyzing specific types of data that are relevant for achieving those objectives as we go further into understanding what kind of database structures we will need to read from on a molecular level (this includes focusing on different types such as transcription factor levels). When looking at all these individual components together they can offer insight into how regulation may be changing within a cell’s environment & which pathways could become activated/ deactivated due its presence or absence throughout different conditions).
4 Keeping Logs And Documents Up To Date: Once done with some sortings or filters on certain columns make sure that your logs/documents stay up-to-date and match up with any changes made during analysis so as not mix-up usage across different documents/sessions throughout our project lifespan itself! This is highly recommended as having an organized record keeping system helps ensure accuracy when dealing with large volumes of information over time periods (thus making sure nothing gets overlooked accidentally!).
We hope these tips help get you started into exploring Physical Gene Regulatory Networks in C Elegans’! If you have any questions feel free to reach out via message – we would love hearing about how things go after implementing them into practice!
- Training machine-learning algorithms to develop automated approaches in predicting gene expression levels of individual regulatory networks.
- Using this dataset alongside data from RNA-seq experiments to investigate how genetic mutations, environmental changes, and other factors can affect gene regulation across C.elegans populations.
- Exploring the correlation between transcription factor binding sites and gene expression levels to predict potential target genes for a given transcription factor
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, m...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Details of the Hi-C datasets from 70 cell-lines/tissues used for the analysis is provided in the table. Hi-C contact data was used to find the target genes of the autoimmune disease-associated SNPs.
Facebook
TwitterLand use data for 2001-2100 from PLUM1.3 (Parsimonious Land Use Model version 1.3) coupled with a global energy-economics model, downscaled to 0.5*0.5 degree gridcells, for the five scenarios SSP1-SSP5 (reference and mitigation strategies) as described in detail in Engström et al. (2017). - Total terrestrial biosphere carbon (kg/m2) for 2001-2100 from simulations using the vegetation model LPJ-GUESS at 0.5*0.5 degree resolution for the 10 SSP-SPA scenarios using 1-3 RCPs for four different climate models as described in Engström et al. (2017). Ref.: Engström K., Lindeskog M., Olin S., Hassler J., and Smith B., (2017) Impacts of climate mitigation strategies in the energy sector on global land use and carbon balance.
Facebook
Twitterhttps://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
Musical Scale Dataset: 1900+ Chroma Tensors Labeled by Scale
This dataset contains 1900+ unique synthetic musical audio samples generated from melodies in each of the 24 Western scales (12 major and 12 minor). Each sample has been converted into a chroma tensor, a 12-dimensional pitch class representation commonly used in music information retrieval (MIR) and deep learning tasks.
chroma_tensor: A JSON-safe formatted of a PyTorch tensor with shape [1, 12, T], where:
12 = the 12 pitch classes (C, C#, D, ... B)T = time steps scale_index: An integer label from 0–23 identifying the scale the sample belongs toThis dataset is ideal for: - Training deep learning models (CNNs, MLPs) to classify musical scales - Exploring pitch-class distributions in Western tonal music - Prototyping models for music key detection, chord prediction, or tonal analysis - Teaching or demonstrating chromagram-based ML workflows
| Index | Scale |
|---|---|
| 0 | C major |
| 1 | C# major |
| ... | ... |
| 11 | B major |
| 12 | C minor |
| ... | ... |
| 23 | B minor |
Chroma tensors are of shape [1, 12, T], where:
- 1 is the channel dimension (for CNN input)
- 12 represents the 12 pitch classes (C through B)
- T is the number of time frames
import torch
import pandas as pd
from tqdm import tqdm
df = pd.read_csv("/content/scale_dataset.csv")
# Reconstruct chroma tensors
X = [torch.tensor(eval(row)).reshape(1, 12, -1) for row in tqdm(df['chroma_tensor'])]
y = df['scale_index'].tolist()
Alternatively, you could directly load the chroma tensors and target scale indices using the .pt file.
import torch
import pandas as pd
data = torch.load("chroma_tensors.pt")
X_pt = data['X'] # list of [1, 12, 302] tensors
y_pt = data['y'] # list of scale indices
music21FluidSynthlibrosa.feature.chroma_stft| Column | Type | Description |
|---|---|---|
chroma_tensor | str | Flattened 1D chroma tensor [1×12×T] |
scale_index | int | Label from 0 to 23 |
T) for easy batching
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
DATASET C 0510 is a dataset for instance segmentation tasks - it contains Crosswalk annotations for 676 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
C Project is a dataset for object detection tasks - it contains Tooth annotations for 713 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Facebook
TwitterThere are five different files for this dataset: 1. A dataset listing the reported functional uses of chemicals (FUse) 2. All 729 ToxPrint descriptors obtained from ChemoTyper for chemicals in FUse 3. All EPI Suite properties obtained for chemicals in FUse 4. The confusion matrix values, similarity thresholds, and bioactivity index for each model. 5. The functional use prediction, bioactivity index, and prediction classification (poor prediction, functional substitute, candidate alternative) for each Tox21 chemical. This dataset is associated with the following publication: Phillips, K., J. Wambaugh, C. Grulke, K. Dionisio, and K. Isaacs. High-throughput screening of chemicals as functional substitutes using structure-based classification models. GREEN CHEMISTRY. Royal Society of Chemistry, Cambridge, UK, 19: 1063-1074, (2017).
Facebook
TwitterThe NOAA Coastal Change Analysis Program (C-CAP) produces national standardized land cover and change products for the coastal regions of the U.S. C-CAP products inventory coastal intertidal areas, wetlands, and adjacent uplands with the goal of monitoring changes in these habitats, on a one-to-five year repeat cycle. The timeframe for this metadata is reported as 1985 - 2010-Era, but the actual dates of the Landsat imagery used to create the land cover may have been acquired a few years before or after each era. These maps are developed utilizing Landsat Thematic Mapper imagery, and can be used to track changes in the landscape through time. This trend information gives important feedback to managers on the success or failure of management policies and programs and aid in developing a scientific understanding of the Earth system and its response to natural and human-induced changes. This understanding allows for the prediction of impacts due to these changes and the assessment of their cumulative effects, helping coastal resource managers make more informed regional decisions. NOAA C-CAP is a contributing member to the Multi-Resolution Land Characteristics consortium and C-CAP products are included as the coastal expression of land cover within the National Land Cover Database.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context : We share a large database containing electroencephalographic signals from 87 human participants, with more than 20,800 trials in total representing about 70 hours of recording. It was collected during brain-computer interface (BCI) experiments and organized into 3 datasets (A, B, and C) that were all recorded following the same protocol: right and left hand motor imagery (MI) tasks during one single day session. It includes the performance of the associated BCI users, detailed information about the demographics, personality and cognitive user’s profile, and the experimental instructions and codes (executed in the open-source platform OpenViBE). Such database could prove useful for various studies, including but not limited to: 1) studying the relationships between BCI users' profiles and their BCI performances, 2) studying how EEG signals properties varies for different users' profiles and MI tasks, 3) using the large number of participants to design cross-user BCI machine learning algorithms or 4) incorporating users' profile information into the design of EEG signal classification algorithms. Sixty participants (Dataset A) performed the first experiment, designed in order to investigated the impact of experimenters' and users' gender on MI-BCI user training outcomes, i.e., users performance and experience, (Pillette & al). Twenty one participants (Dataset B) performed the second one, designed to examined the relationship between users' online performance (i.e., classification accuracy) and the characteristics of the chosen user-specific Most Discriminant Frequency Band (MDFB) (Benaroch & al). The only difference between the two experiments lies in the algorithm used to select the MDFB. Dataset C contains 6 additional participants who completed one of the two experiments described above. Physiological signals were measured using a g.USBAmp (g.tec, Austria), sampled at 512 Hz, and processed online using OpenViBE 2.1.0 (Dataset A) & OpenVIBE 2.2.0 (Dataset B). For Dataset C, participants C83 and C85 were collected with OpenViBE 2.1.0 and the remaining 4 participants with OpenViBE 2.2.0. Experiments were recorded at Inria Bordeaux sud-ouest, France. Duration : Each participant's folder is composed of approximately 48 minutes EEG recording. Meaning six 7-minutes runs and a 6-minutes baseline. Documents Instructions: checklist read by experimenters during the experiments. Questionnaires: the Mental Rotation test used, the translation of 4 questionnaires, notably the Demographic and Social information, the Pre and Post-session questionnaires, and the Index of Learning style. English and french version Performance: The online OpenViBE BCI classification performances obtained by each participant are provided for each run, as well as answers to all questionnaires Scenarios/scripts : set of OpenViBE scenarios used to perform each of the steps of the MI-BCI protocol, e.g., acquire training data, calibrate the classifier or run the online MI-BCI Database : raw signals Dataset A : N=60 participants Dataset B : N=21 participants Dataset C : N=6 participants
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Capstone C Final Dataset 1 is a dataset for object detection tasks - it contains Cars annotations for 4,564 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Provisional database: The data you have secured from the U.S. Geological Survey (USGS) database identified as Preliminary Coastal Grain Size Portal (C-GRASP) dataset. Version 1, January 2022 have not received USGS approval and as such are provisional and subject to revision. The data are released on the condition that neither the USGS nor the U.S. Government shall be held liable for any damages resulting from its authorized or unauthorized use.
Version 1 (January 2022) of the the Coastal Grain Size Portal (C-GRASP) database. This is a preliminary internal deliverable for the National Oceanography Partnership Program (NOPP) Task 1 / USGS Gesch team and project partners only.
The primary purpose of this Provisional data release is to provide National Oceanography Partnership Program (NOPP) project partners with programmatic access to this preliminary version of the Coastal Grain Size Portal (C-GRASP) database for internal project use. These data are preliminary or provisional and are subject to revision. They are being provided to meet the need for timely best science. The data have not received final approval by the U.S. Geological Survey (USGS) and are provided on the condition that neither the USGS nor the U.S. Government shall be held liable for any damages resulting from the authorized or unauthorized use of the data.
This preliminary data release contains various files that list grain size information collated from secondary data already in the public domain, in the form of public datasets, or in published literature.
Where possible, we have indicated the source, location, and sampling methods used to obtain these data. Where not possible to establish these facts, those fields have been left empty.
More information on our methods, data sources, and data processing and analysis codes are found on our github page
The dataset consists of one zipped file, Source_Files.zip, and 4 comma separated value (csv) files
dataset_10kmcoast.csv- This is all data that is found to be within 10km of the Natural Earth coastline polyline
Data_EstimatedOnshore.csv- This is all the data from dataset_10kmcoast.csv that lies within the Natural Earth United States Polygon
Data_VerifiedOnshore.csv- This is all data that was able to be verified onshore from either sampling method, note, or location type data
Data_Post2012_VerifiedOnshore.csv- This is all the data from Data_VerifiedOnshore.csv that is after 2012
The files each have the following fields (no data is blank):
'ID': row ID integer
'Sample_ID': identifier to raw data source
'Sample_Type_Code': code of sample id
'Project': raw datasource project identifier
'dataset': raw dataset major identifier
'Date': date, where specified, and to whatever precision that is specified
'Location_Type': where specified, code indicating type of location information
'latitude': latitude in decimal degrees
'longitude': longitude in decimal degrees
'Contact': where specified, raw data originator
'num_orig_dists': number of unique grain size distributions
'Measured_Distributions': number iof measured grain size distributions
'Grainsize': grain size is sometimes reported without specification
'Mean', mean grain size in mm
'Median', median grain size in mm
'Wentworth', wentworth name (one of ['Clay', 'CoarseSand', 'CoarseSilt', 'Cobble', 'FineSand', 'FineSilt', 'Granule', 'MediumSand', 'MediumSilt', 'Pebble', 'VeryCoarseSand', 'VeryFineSand', 'VeryFineSilt'])
'Kurtosis', kurtosis value (non-dim)
'Kurtosis_Class', kurtosis category
'Skewness', skewness value (non-dim)
'Skewness_Class', skewness category
'Std', standard deviation of grain sizes
'Sorting', sorting category
'd5', grain size distribution 5th percentile
'd10', grain size distribution 10th percentile
'd16', grain size distribution 16th percentile
'd25', grain size distribution 25th percentile
'd30', grain size distribution 30th percentile
'd50', grain size distribution 50th percentile
'd65', grain size distribution 65th percentile
'd75', grain size distribution 75th percentile
'd84',grain size distribution 84th percentile
'd90', grain size distribution 90th percentile
'd95', grain size distribution 95th percentile
'Notes': notes - these can be informative and substantial, do not disregard
Source_Files.zip contains 11 comma separated value files, namely bicms.csv boem.csv clark.csv dbseabed.csv ecstdb.csv mass.csv mcfall.csv rossi.csv sandsnap.csv sbell.csv ussb.csv, which contain raw datasets that have been collated and extracted from their native formats into csv format
Facebook
Twitter <p class="gem-c-attachment_metadata"><span class="gem-c-attachment_attribute">MS Excel Spreadsheet</span>, <span class="gem-c-attachment_attribute">16.5 KB</span></p>
<p class="gem-c-attachment_metadata">This file may not be suitable for users of assistive technology.</p>
<details data-module="ga4-event-tracker" data-ga4-event='{"event_name":"select_content","type":"detail","text":"Request an accessible format.","section":"Request an accessible format.","index_section":1}' class="gem-c-details govuk-details govuk-!-margin-bottom-0" title="Request an accessible format.">
Request an accessible format.
If you use assistive technology (such as a screen reader) and need a version of this document in a more accessible format, please email <a href="mailto:alternativeformats@communities.gov.uk" target="_blank" class="govuk-link">alternativeformats@communities.gov.uk</a>. Please tell us what format you need. It will help us if you say what assistive technology you use.
<p class="gem-c-attachment_metadata"><span class="gem-c-attachment_attribute">MS Excel Spreadsheet</span>, <span class="gem-c-attachment_attribute">18 KB</span></p>
<p class="gem-c-attachment_metadata">This file may not be suitable for users of assistive technology.</p>
<details data-module="ga4-event-tracker" data-ga4-event='{"event_name":"select_content","type":"detail","text":"Request an accessible format.","section":"Request an accessible format.","index_section":1}' class="gem-c-details govuk-details govuk-!-margin-bottom-0" title="Request an accessible format.">
Request an accessible format.
If you use assistive technology (such as a screen reader) and need a version of this document in a more accessible format, please email <a href="mailto:alternativeformats@communities.gov.uk" target="_blank" class="govuk-link">alternativeformats
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset contains 10,000 unique C++ programming prompts along with their corresponding code responses, designed specifically for training and evaluating natural language generation models such as Transformers. ** Each row in the CSV contains:**
id: A unique identifier for each record.
prompt: A C++ programming instruction or task, phrased in natural language.
response: The corresponding C++ source code fulfilling the prompt.
The prompts include a wide range of programming concepts, such as:
Basic arithmetic operations
Loops and conditionals
Class and object creation
Recursion and algorithm design
Template functions and data structures
This dataset is ideal for:
Fine-tuning code generation models (e.g., GPT-style models)
Creating educational tools or auto-code assistants
Exploring zero-shot/few-shot learning in code generation
Following Code can Be used to complete all #TODO Programs in the Dataset:
import pandas as pd from transformers import AutoModelForCausalLM, AutoTokenizer import torch from tqdm import tqdm
df = pd.read_csv("/Path/CPP_Dataset_MujtabaAhmed.csv")
model_name = "Salesforce/codegen-350M-mono" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name).cuda() # Use .cpu() if no GPU
def complete_code(prompt): input_text = prompt.strip() + " " inputs = tokenizer(input_text, return_tensors="pt").to(model.device) output = model.generate( **inputs, max_length=512, num_return_sequences=1, temperature=0.7, do_sample=True, top_p=0.95, pad_token_id=tokenizer.eos_token_id ) decoded = tokenizer.decode(output[0], skip_special_tokens=True) return decoded.replace(prompt.strip(), "").strip()
completed_responses = []
for i, row in tqdm(df.iterrows(), total=len(df), desc="Processing"): prompt, response = row["prompt"], row["response"] if "TODO" in response: generated = complete_code(prompt + " " + response.split("TODO")[0]) response_filled = response.replace("TODO", generated) else: response_filled = response completed_responses.append(response_filled)
df["response"] = completed_responses df.to_csv("CPP_Dataset_Completed.csv", index=False) print("✅ Completed CSV saved as 'CPP_Dataset_Completed.csv'")
Facebook
TwitterAttribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
Citation metrics are widely used and misused. We have created a publicly available database of top-cited scientists that provides standardized information on citations, h-index, co-authorship adjusted hm-index, citations to papers in different authorship positions and a composite indicator (c-score). Separate data are shown for career-long and, separately, for single recent year impact. Metrics with and without self-citations and ratio of citations to citing papers are given and data on retracted papers (based on Retraction Watch database) as well as citations to/from retracted papers have been added in the most recent iteration. Scientists are classified into 22 scientific fields and 174 sub-fields according to the standard Science-Metrix classification. Field- and subfield-specific percentiles are also provided for all scientists with at least 5 papers. Career-long data are updated to end-of-2023 and single recent year data pertain to citations received during calendar year 2023. The selection is based on the top 100,000 scientists by c-score (with and without self-citations) or a percentile rank of 2% or above in the sub-field. This version (7) is based on the August 1, 2024 snapshot from Scopus, updated to end of citation year 2023. This work uses Scopus data. Calculations were performed using all Scopus author profiles as of August 1, 2024. If an author is not on the list it is simply because the composite indicator value was not high enough to appear on the list. It does not mean that the author does not do good work. PLEASE ALSO NOTE THAT THE DATABASE HAS BEEN PUBLISHED IN AN ARCHIVAL FORM AND WILL NOT BE CHANGED. The published version reflects Scopus author profiles at the time of calculation. We thus advise authors to ensure that their Scopus profiles are accurate. REQUESTS FOR CORRECIONS OF THE SCOPUS DATA (INCLUDING CORRECTIONS IN AFFILIATIONS) SHOULD NOT BE SENT TO US. They should be sent directly to Scopus, preferably by use of the Scopus to ORCID feedback wizard (https://orcid.scopusfeedback.com/) so that the correct data can be used in any future annual updates of the citation indicator databases. The c-score focuses on impact (citations) rather than productivity (number of publications) and it also incorporates information on co-authorship and author positions (single, first, last author). If you have additional questions, see attached file on FREQUENTLY ASKED QUESTIONS. Finally, we alert users that all citation metrics have limitations and their use should be tempered and judicious. For more reading, we refer to the Leiden manifesto: https://www.nature.com/articles/520429a
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
AutoNaVIT is a meticulously developed dataset designed to accelerate research in autonomous navigation, semantic scene understanding, and object segmentation through deep learning. This release includes only the annotation labels in XML format, aligned with high-resolution frames extracted from a controlled driving sequence at Vellore Institute of Technology – Chennai Campus (VIT-C). The corresponding images will be included in Version 2 of the dataset.
Class Annotations The dataset features carefully annotated bounding boxes for the following three essential classes relevant to real-time navigation and path planning in autonomous vehicles:
Kerb – 1,377 instances
Obstacle – 258 instances
Path – 532 instances
All annotations were produced using Roboflow with human-verified precision, ensuring consistent, high-quality data that supports robust model development for urban and semi-urban scenarios.
Data Capture Specifications The source video was captured using a Sony IMX890 sensor, under stable daylight lighting. Below are the capture parameters:
Sensor Size: 1/1.56", 50 MP
Lens: 6P optical configuration
Aperture: ƒ/1.8
Focal Length: 24mm equivalent
Pixel Size: 1.0 µm
Features: Optical Image Stabilization (OIS), PDAF autofocus
Video Duration: 4 minutes 11 seconds
Frame Rate: 2 FPS
Total Annotated Frames: 504
Format Compatibility and Model Support AutoNaVIT annotations are provided in Pascal VOC-compatible XML format, making them directly usable with models that support the Pascal VOC standard. The dataset is immediately compatible with:
Pascal VOC
As XML is a structured, extensible format, these annotations can be easily adapted for use with additional object detection frameworks that support XML-based label schemas.
Benchmark Results To assess dataset utility, a YOLOv8 segmentation model was trained on the full dataset (including images). The model achieved the following results:
Mean Average Precision (mAP): 96.5%
Precision: 92.2%
Recall: 94.4%
These metrics demonstrate the dataset’s effectiveness in training models for autonomous vehicle perception and obstacle detection.
Disclaimer and Attribution Requirement By downloading or using this dataset, users agree to the terms outlined in the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC BY-NC-ND 4.0):
This dataset is available solely for academic and non-commercial research purposes.
Proper attribution must be provided as follows: “Dataset courtesy of Vellore Institute of Technology – Chennai Campus.” This citation must appear in all research papers, presentations, or any work derived from this dataset.
Redistribution, public hosting, commercial use, or modification is prohibited without prior written permission from VIT-C.
Use of this dataset implies acceptance of these terms. All rights not explicitly granted are retained by VIT-C.
Facebook
TwitterDatasets used in ORD-025118: Using a Gene Expression Biomarker to Identify DNA Damage-Inducing Agents in Microarray Profiles. This dataset is associated with the following publication: Corton, C., A. Williams, and C. Yauk. Using a Gene Expression Biomarker to Identify DNA Damage-Inducing Agents in Microarray Profiles. ENVIRONMENTAL AND MOLECULAR MUTAGENESIS. John Wiley & Sons, Inc, Hoboken, NJ, USA, 59(9): 772-784, (2018).
Facebook
TwitterDataset Card for c-sharp-coding-dataset
This dataset has been created with distilabel.
Dataset Summary
This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/dmeldrum6/c-sharp-coding-dataset/raw/main/pipeline.yaml"
or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/dmeldrum6/c-sharp-coding-dataset.