100+ datasets found
  1. Z

    Multimodal Vision-Audio-Language Dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Schaumlöffel, Timothy; Roig, Gemma; Choksi, Bhavin (2024). Multimodal Vision-Audio-Language Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10060784
    Explore at:
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    Goethe University Frankfurt
    Authors
    Schaumlöffel, Timothy; Roig, Gemma; Choksi, Bhavin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities. Details can be found in the attached report. Annotation The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library. The split into train, validation and test set follows the split of the original datasets. Installation

    pip install pandas pyarrow Example

    import pandas as pddf = pd.read_parquet('annotation_train.parquet', engine='pyarrow')print(df.iloc[0])

    dataset AudioSet filename train/---2_BBVHAA.mp3 captions_visual [a man in a black hat and glasses.] captions_auditory [a man speaks and dishes clank.] tags [Speech] Description The annotation file consists of the following fields:filename: Name of the corresponding file (video or audio file)dataset: Source dataset associated with the data pointcaptions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual contentcaptions_auditory: A list of captions related to the auditory content of the videotags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided Data files The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de

  2. original : CIFAR 100

    • kaggle.com
    zip
    Updated Dec 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shashwat Pandey (2024). original : CIFAR 100 [Dataset]. https://www.kaggle.com/datasets/shashwat90/original-cifar-100
    Explore at:
    zip(168517945 bytes)Available download formats
    Dataset updated
    Dec 28, 2024
    Authors
    Shashwat Pandey
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The CIFAR-10 and CIFAR-100 datasets are labeled subsets of the 80 million tiny images dataset. CIFAR-10 and CIFAR-100 were created by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. (Sadly, the 80 million tiny images dataset has been thrown into the memory hole by its authors. Spotting the doublethink which was used to justify its erasure is left as an exercise for the reader.)

    The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

    The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.

    The classes are completely mutually exclusive. There is no overlap between automobiles and trucks. "Automobile" includes sedans, SUVs, things of that sort. "Truck" includes only big trucks. Neither includes pickup trucks.

    Baseline results You can find some baseline replicable results on this dataset on the project page for cuda-convnet. These results were obtained with a convolutional neural network. Briefly, they are 18% test error without data augmentation and 11% with. Additionally, Jasper Snoek has a new paper in which he used Bayesian hyperparameter optimization to find nice settings of the weight decay and other hyperparameters, which allowed him to obtain a test error rate of 15% (without data augmentation) using the architecture of the net that got 18%.

    Other results Rodrigo Benenson has collected results on CIFAR-10/100 and other datasets on his website; click here to view.

    Dataset layout Python / Matlab versions I will describe the layout of the Python version of the dataset. The layout of the Matlab version is identical.

    The archive contains the files data_batch_1, data_batch_2, ..., data_batch_5, as well as test_batch. Each of these files is a Python "pickled" object produced with cPickle. Here is a python2 routine which will open such a file and return a dictionary: python def unpickle(file): import cPickle with open(file, 'rb') as fo: dict = cPickle.load(fo) return dict And a python3 version: def unpickle(file): import pickle with open(file, 'rb') as fo: dict = pickle.load(fo, encoding='bytes') return dict Loaded in this way, each of the batch files contains a dictionary with the following elements: data -- a 10000x3072 numpy array of uint8s. Each row of the array stores a 32x32 colour image. The first 1024 entries contain the red channel values, the next 1024 the green, and the final 1024 the blue. The image is stored in row-major order, so that the first 32 entries of the array are the red channel values of the first row of the image. labels -- a list of 10000 numbers in the range 0-9. The number at index i indicates the label of the ith image in the array data.

    The dataset contains another file, called batches.meta. It too contains a Python dictionary object. It has the following entries: label_names -- a 10-element list which gives meaningful names to the numeric labels in the labels array described above. For example, label_names[0] == "airplane", label_names[1] == "automobile", etc. Binary version The binary version contains the files data_batch_1.bin, data_batch_2.bin, ..., data_batch_5.bin, as well as test_batch.bin. Each of these files is formatted as follows: <1 x label><3072 x pixel> ... <1 x label><3072 x pixel> In other words, the first byte is the label of the first image, which is a number in the range 0-9. The next 3072 bytes are the values of the pixels of the image. The first 1024 bytes are the red channel values, the next 1024 the green, and the final 1024 the blue. The values are stored in row-major order, so the first 32 bytes are the red channel values of the first row of the image.

    Each file contains 10000 such 3073-byte "rows" of images, although there is nothing delimiting the rows. Therefore each file should be exactly 30730000 bytes long.

    There is another file, called batches.meta.txt. This is an ASCII file that maps numeric labels in the range 0-9 to meaningful class names. It is merely a list of the 10 class names, one per row. The class name on row i corresponds to numeric label i.

    The CIFAR-100 dataset This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). Her...

  3. Z

    #PraCegoVer dataset

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Jan 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gabriel Oliveira dos Santos; Esther Luna Colombini; Sandra Avila (2023). #PraCegoVer dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5710561
    Explore at:
    Dataset updated
    Jan 19, 2023
    Dataset provided by
    Institute of Computing, University of Campinas
    Authors
    Gabriel Oliveira dos Santos; Esther Luna Colombini; Sandra Avila
    Description

    Automatically describing images using natural sentences is an essential task to visually impaired people's inclusion on the Internet. Although there are many datasets in the literature, most of them contain only English captions, whereas datasets with captions described in other languages are scarce.

    PraCegoVer arose on the Internet, stimulating users from social media to publish images, tag #PraCegoVer and add a short description of their content. Inspired by this movement, we have proposed the #PraCegoVer, a multi-modal dataset with Portuguese captions based on posts from Instagram. It is the first large dataset for image captioning in Portuguese with freely annotated images.

    PraCegoVer has 533,523 pairs with images and captions described in Portuguese collected from more than 14 thousand different profiles. Also, the average caption length in #PraCegoVer is 39.3 words and the standard deviation is 29.7.

    Dataset Structure

    PraCegoVer dataset is composed of the main file dataset.json and a collection of compressed files named images.tar.gz.partX

    containing the images. The file dataset.json comprehends a list of json objects with the attributes:

    user: anonymized user that made the post;

    filename: image file name;

    raw_caption: raw caption;

    caption: clean caption;

    date: post date.

    Each instance in dataset.json is associated with exactly one image in the images directory whose filename is pointed by the attribute filename. Also, we provide a sample with five instances, so the users can download the sample to get an overview of the dataset before downloading it completely.

    Download Instructions

    If you just want to have an overview of the dataset structure, you can download sample.tar.gz. But, if you want to use the dataset, or any of its subsets (63k and 173k), you must download all the files and run the following commands to uncompress and join the files:

    cat images.tar.gz.part* > images.tar.gz tar -xzvf images.tar.gz

    Alternatively, you can download the entire dataset from the terminal using the python script download_dataset.py available in PraCegoVer repository. In this case, first, you have to download the script and create an access token here. Then, you can run the following command to download and uncompress the image files:

    python download_dataset.py --access_token=

  4. Job Dataset

    • kaggle.com
    zip
    Updated Sep 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ravender Singh Rana (2023). Job Dataset [Dataset]. https://www.kaggle.com/datasets/ravindrasinghrana/job-description-dataset
    Explore at:
    zip(479575920 bytes)Available download formats
    Dataset updated
    Sep 17, 2023
    Authors
    Ravender Singh Rana
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Job Dataset

    This dataset provides a comprehensive collection of synthetic job postings to facilitate research and analysis in the field of job market trends, natural language processing (NLP), and machine learning. Created for educational and research purposes, this dataset offers a diverse set of job listings across various industries and job types.

    Descriptions for each of the columns in the dataset:

    1. Job Id: A unique identifier for each job posting.
    2. Experience: The required or preferred years of experience for the job.
    3. Qualifications: The educational qualifications needed for the job.
    4. Salary Range: The range of salaries or compensation offered for the position.
    5. Location: The city or area where the job is located.
    6. Country: The country where the job is located.
    7. Latitude: The latitude coordinate of the job location.
    8. Longitude: The longitude coordinate of the job location.
    9. Work Type: The type of employment (e.g., full-time, part-time, contract).
    10. Company Size: The approximate size or scale of the hiring company.
    11. Job Posting Date: The date when the job posting was made public.
    12. Preference: Special preferences or requirements for applicants (e.g., Only Male or Only Female, or Both)
    13. Contact Person: The name of the contact person or recruiter for the job.
    14. Contact: Contact information for job inquiries.
    15. Job Title: The job title or position being advertised.
    16. Role: The role or category of the job (e.g., software developer, marketing manager).
    17. Job Portal: The platform or website where the job was posted.
    18. Job Description: A detailed description of the job responsibilities and requirements.
    19. Benefits: Information about benefits offered with the job (e.g., health insurance, retirement plans).
    20. Skills: The skills or qualifications required for the job.
    21. Responsibilities: Specific responsibilities and duties associated with the job.
    22. Company Name: The name of the hiring company.
    23. Company Profile: A brief overview of the company's background and mission.

    Potential Use Cases:

    • Building predictive models to forecast job market trends.
    • Enhancing job recommendation systems for job seekers.
    • Developing NLP models for resume parsing and job matching.
    • Analyzing regional job market disparities and opportunities.
    • Exploring salary prediction models for various job roles.

    Acknowledgements:

    We would like to express our gratitude to the Python Faker library for its invaluable contribution to the dataset generation process. Additionally, we appreciate the guidance provided by ChatGPT in fine-tuning the dataset, ensuring its quality, and adhering to ethical standards.

    Note:

    Please note that the examples provided are fictional and for illustrative purposes. You can tailor the descriptions and examples to match the specifics of your dataset. It is not suitable for real-world applications and should only be used within the scope of research and experimentation. You can also reach me via email at: rrana157@gmail.com

  5. h

    python

    • huggingface.co
    Updated Feb 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    changamonika (2025). python [Dataset]. https://huggingface.co/datasets/changamonika/python
    Explore at:
    Dataset updated
    Feb 20, 2025
    Authors
    changamonika
    Description

    pip install numpy pandas scikit-learn matplotlib seaborn

      Sample dataset: Age, Salary, and whether they purchased (1 = Yes, 0 = No)
    

    data = { 'Age': [22, 25, 47, 52, 46, 56, 24, 27, 32, 37], 'Salary': [20000, 25000, 50000, 60000, 58000, 70000, 22000, 27000, 32000, 37000], 'Purchased': [0, 0, 1, 1, 1, 1, 0, 0, 1, 1] } df = pd.DataFrame(data)

      Split dataset into Features (X) and Target (y)
    

    X = df[['Age', 'Salary']] # Independent variables y = df['Purchased'] #… See the full description on the dataset page: https://huggingface.co/datasets/changamonika/python.

  6. Datasets for manuscript "A data engineering framework for chemical flow...

    • catalog.data.gov
    • gimi9.com
    Updated Nov 7, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2021). Datasets for manuscript "A data engineering framework for chemical flow analysis of industrial pollution abatement operations" [Dataset]. https://catalog.data.gov/dataset/datasets-for-manuscript-a-data-engineering-framework-for-chemical-flow-analysis-of-industr
    Explore at:
    Dataset updated
    Nov 7, 2021
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    The EPA GitHub repository PAU4ChemAs as described in the README.md file, contains Python scripts written to build the PAU dataset modules (technologies, capital and operating costs, and chemical prices) for tracking chemical flows transfers, releases estimation, and identification of potential occupation exposure scenarios in pollution abatement units (PAUs). These PAUs are employed for on-site chemical end-of-life management. The folder datasets contains the outputs for each framework step. The Chemicals_in_categories.csv contains the chemicals for the TRI chemical categories. The EPA GitHub repository PAU_case_study as described in its readme.md entry, contains the Python scripts to run the manuscript case study for designing the PAUs, the data-driven models, and the decision-making module for chemicals of concern and tracking flow transfers at the end-of-life stage. The data was obtained by means of data engineering using different publicly-available databases. The properties of chemicals were obtained using the GitHub repository Properties_Scraper, while the PAU dataset using the repository PAU4Chem. Finally, the EPA GitHub repository Properties_Scraper contains a Python script to massively gather information about exposure limits and physical properties from different publicly-available sources: EPA, NOAA, OSHA, and the institute for Occupational Safety and Health of the German Social Accident Insurance (IFA). Also, all GitHub repositories describe the Python libraries required for running their code, how to use them, the obtained outputs files after running the Python script modules, and the corresponding EPA Disclaimer. This dataset is associated with the following publication: Hernandez-Betancur, J.D., M. Martin, and G.J. Ruiz-Mercado. A data engineering framework for on-site end-of-life industrial operations. JOURNAL OF CLEANER PRODUCTION. Elsevier Science Ltd, New York, NY, USA, 327: 129514, (2021).

  7. Z

    Python Systems for Empirical Analysis

    • data.niaid.nih.gov
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matteo Orrù (2020). Python Systems for Empirical Analysis [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_268468
    Explore at:
    Dataset updated
    Jan 24, 2020
    Authors
    Matteo Orrù
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Reference

    Studies who have been using the data (in any form) are required to include the following reference:

    @inproceedings{Orru2015, abstract = {The aim of this paper is to present a dataset of metrics associated to the first release of a curated collection of Python software systems. We describe the dataset along with the adopted criteria and the issues we faced while building such corpus. This dataset can enhance the reliability of empirical studies, enabling their reproducibility, reducing their cost, and it can foster further research on Python software.}, author = {Orrú, Matteo and Tempero, Ewan and Marchesi, Michele and Tonelli, Roberto and Destefanis, Giuseppe}, booktitle = {Submitted to PROMISE '15}, keywords = {Python, Empirical Studies, Curated Code Collection}, title = {A Curated Benchmark Collection of Python Systems for Empirical Studies on Software Engineering}, year = {2015} }

    About the Data

    Overview

    This paper presents a dataset of metrics taken from a curated collection of 51 popular Python software systems.

    The dataset reports 41 metrics of different categories: volume/size, complexity and object oriented metrics. These metrics and computed both at file and class level. We provide metrics for any file and class of each system and global metrics (computed on the entire system). Moreover we provide 14 meta-data for each system.

    Paper Abstract

    The aim of this paper is to present a dataset of metrics associated to the first release of a curated collection of Python software systems. We describe the dataset along with the adopted criteria and the issues we faced while building such corpus. This dataset can enhance the reliability of empirical studies, enabling their reproducibility, reducing their cost, and it can foster further research on Python software.

  8. Naukri data Scraped

    • kaggle.com
    zip
    Updated Jul 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tadaka SuryaTeja (2022). Naukri data Scraped [Dataset]. https://www.kaggle.com/datasets/tadakasuryateja/python-jobs-and-salaries
    Explore at:
    zip(926089 bytes)Available download formats
    Dataset updated
    Jul 13, 2022
    Authors
    Tadaka SuryaTeja
    Description

    Scraped data from Naukri. This dataset is extracted on 13-july-2022. The salaries may vary in future.

    The data consists of 8 Columns: 1) Job title 2) Company Name 3) Experience 4) Salary 5) Location 6) Key skills 7) About Company 8) Job Description

    Following data is extracted from Naukri search using the Keywords: Python, 3years experience and Hyderabad Location.

    You can extract the required data by running the naukri_scrape.py file which is available in the below github link.

    Source code available at: https://github.com/TadakaSuryaTeja/LinkedIn_Automation/blob/main/naukri_scrape.py

  9. h

    Magicoder-Evol-Instruct-110K-python

    • huggingface.co
    Updated Nov 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    pxy (2024). Magicoder-Evol-Instruct-110K-python [Dataset]. https://huggingface.co/datasets/pxyyy/Magicoder-Evol-Instruct-110K-python
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 17, 2024
    Authors
    pxy
    Description

    Dataset Card for "Magicoder-Evol-Instruct-110K-python"

    from datasets import load_dataset

    Load your dataset

    dataset = load_dataset("pxyyy/Magicoder-Evol-Instruct-110K", split="train") # Replace with your dataset and split

    Define a filter function

    def contains_python(entry): for c in entry["messages"]: if "python" in c['content'].lower(): return True # return "python" in entry["messages"].lower() # Replace 'column_name' with the column to search

    … See the full description on the dataset page: https://huggingface.co/datasets/pxyyy/Magicoder-Evol-Instruct-110K-python.

  10. US Travel Check-Ins - Analysis

    • kaggle.com
    zip
    Updated Feb 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). US Travel Check-Ins - Analysis [Dataset]. https://www.kaggle.com/datasets/thedevastator/us-travel-check-ins-analysis
    Explore at:
    zip(2350764 bytes)Available download formats
    Dataset updated
    Feb 11, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    United States
    Description

    US Travel Check-Ins - Analysis

    In-Depth Study of Location, Date, Temperature, USIndex, and Crime Rates

    By [source]

    About this dataset

    This comprehensive dataset offers an in-depth exploration into US travel check-ins from Instagram. It includes detailed data scraped from Instagram, such as the location of each check-in, the USIndex for each state, average temperature for each state per month, and crime rate per state. In addition to location and time information, this dataset also provides latitude and longitude coordinates for every entry. This extensive collection of data is invaluable for those interested in studying various aspects of movement within the United States. With detailed insights on factors like climate conditions and economic health of a region at a given point in time, this dataset can help uncover fascinating trends regarding how travelers choose their destinations and how they experience their journeys around the country

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This Kaggle dataset - US Travel Check-Ins Analysis - provides valuable insights for travel researchers, marketers and businesses in the travel industry. It contains check-in location, USIndex rating (economic health of each state), average temperature, and crime rate per state. Latitude and longitude of each check-ins are also provided with added geographic context to help you visualize the data.

    This guide will show you how to use this dataset for your research or business venture.

    Step 1: Prepare your data First and foremost, it is important to cleanse your data before you can analyze it. Depending on what sort of analysis needs to be conducted (e.g., time series analysis) you will need to select the applicable columns from the dataset that match your needs best and exclude any unnecessary columns such as dates or season related data points as they are not relevant here. Furthermore, variable formatting should be consistent across all instances in a variable/column category as well (elevation is a good example here). You can always double check that everything is formatted correctly by running a quick summary on selected columns using conditional queries like df['var'].describe() command in Python for descriptive results about an entire column’s statistical makeup including mean values, quartile ranges etc..

    Step 2: Explore & Analyze Your Data Graphically Once the data has been prepped properly you can start visualizing it in order to gain better insights into any trends or patterns that may be present within it when compared with other datasets or information sources simultaneously such as weather forecasts or nationwide trend indicators etc.. Grafana dashboards are feasible solutions when multiple dataset need to be compared but depending on what type of graphs/charts being used Excel worksheet formats can offer great customization options flexiblity along with various export file types (.csv; .jpegs; .pdfs). Plotting markers onto map applications like Google Maps API offers more geographical awareness that could useful when analyzing location dependent variables too which means we have one advantage over manual inspection tasks just by leveraging existing software applications alongside publicly available APIs!

    Step 3: Interpretation & Hypothesis Testing
    After generating informative graphical interpretation from exploratory visualizations the next step would involve testing out various hypotheses based on established correlations between different variables derived from overall quantitative estimates vizualizations regarding distribution trends across different regions tends towards geographical areas where certain logistical processes could yeild higher success ratios giving potential customers greater satisfaction than

    Research Ideas

    • Travel trends analysis: Using this dataset, researchers could track which areas of the US are popular destinations based on travel check-ins and spot any interesting trends or correlations in terms of geography, seasonal changes, economic health or crime rates.
    • Predictive Modeling: By using various features from this dataset such as average temperature, US Index and crime rate, predictors could be developed to suggest how safe an area would feel to a tourist based on their current location and other predetermined variables they choose to input into the model.
    • Trip Planning Tool: The dataset can also be used to develop a tool that quickly allows travelers to plan trips according to their preferences in terms of duration and budget as well a...
  11. Z

    Data from: Computational 3D resolution enhancement for optical coherence...

    • data.niaid.nih.gov
    Updated Jul 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jos de Wit; George-Othon Glentis; Jeroen Kalkman (2024). Computational 3D resolution enhancement for optical coherence tomography with a narrowband visible light source [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7870794
    Explore at:
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Delft University of Technology, Delft, The Netherlands
    University of Peloponnese, Tripolis, Greece
    Authors
    Jos de Wit; George-Othon Glentis; Jeroen Kalkman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains the code and data underlying the publication "Computational 3D resolution enhancement for optical coherence tomography with a narrowband visible light source" in Biomedical Optics Express 14, 3532-3554 (2023) (doi.org/10.1364/BOE.487345).

    The reader is free to use the scripts and data in this depository, as long as the manuscript is correctly cited in their work. For further questions, please contact the corresponding author.

    Description of the code and datasets

    Table 1 describes all the Matlab and Python scripts in this depository. Table 2 describes the datasets. The input datasets are the phase corrected datasets, as the raw data is large in size and phase correction using a coverslip as reference is rather straightforward. Processed datasets are also added to the repository to allow for running only a limited number of scripts, or to obtain for example the aberration corrected data without the need to use python. Note that the simulation input data (input_simulations_pointscatters_SLDshape_98zf_noise75.mat) is generated with random noise, so if this is overwritten de results may slightly vary. Also the aberration correction is done with random apertures, so the processed aberration corrected data (exp_pointscat_image_MIAA_ISAM_CAO.mat and exp_leaf_image_MIAA_ISAM_CAO.mat) will also slightly change if the aberration correction script is run anew. The current processed datasets are used as basis for the figures in the publication. For details on the implementation we refer to the publication.

    Table 1: The Matlab and Python scripts with their description
    
    
        Script name
        Description
    
    
        MIAA_ISAM_processing.m
        This scripts performs the DFT, RFIAA and MIAA processing of the phase-corrected data that can be loaded from the datasets. Afterwards it also applies ISAM on the DFT and MIAA data and plots the results in a figure (via the scripts plot_figure3, plot_figure5 and plot_simulationdatafigure).
    
    
        resolution_analysis_figure4.m
        This figure loads the data from the point scatterers (absolute amplitude data), seeks the point scatterrers and fits them to obtain the resolution data. Finally it plots figure 4 of the publication.
    
    
        fiaa_oct_c1.m, oct_iaa_c1.m, rec_fiaa_oct_c1.m, rfiaa_oct_c1.m 
        These four functions are used to apply fast IAA and MIAA. See script MIAA_ISAM_processing.m for their usage.
    
    
        viridis.m, morgenstemning.m
        These scripts define the colormaps for the figures.
    
    
        plot_figure3.m, plot_figure5.m, plot_simulationdatafigure.m
        These scripts are used to plot the figures 3 and 5 and a figure with simulation data. These scripts are executed at the end of script MIAA_ISAM_processing.m.
    
    
        Python script: computational_adaptive_optics_script.py
        Python script that applied computational adaptive optics to obtain the data for figure 6 of the manuscript.
    
    
        Python script: zernike_functions2.py
        Python script that gives the values and carthesian derrivatives of the Zernike polynomials.
    
    
        figure6_ComputationalAdaptiveOptics.m
        Script that loads the CAO data that was saved in Python, analyzes the resolution, and plots figure 6.
    
    
        Python script: OCTsimulations_3D_script2.py
        Python script simulates OCT data, adds noise and saves it as .mat file for use in the matlab script above.
    
    
        Python script: OCTsimulations2.py
        Module that contains a python class that can be used to simulate 3D OCT datasets based on a Gaussian beam.
    
    
        Matlab toolbox DIPimage 2.9.zip
        Dipimage is used in the scripts. The toolbox can be downloaded online or this zip can be used.
    
    
    
    
    
    
    The datasets in this Zenodo repository
    
    
        Name
        Description
    
    
        input_leafdisc_phasecorrected.mat
        Phase corrected input image of the leaf disc (used in figure 5).
    
    
        input_TiO2gelatin_004_phasecorrected.mat
        Phase corrected input image of the TiO2 in gelatin sample.
    
    
        input_simulations_pointscatters_SLDshape_98zf_noise75
        Input simulation data that, once processed, is used in figure 4.
    

    exp_pointscat_image_DFT.mat

    exp_pointscat_image_DFT_ISAM.mat

    exp_pointscat_image_RFIAA.mat

    exp_pointscat_image_MIAA_ISAM.mat

    exp_pointscat_image_MIAA_ISAM_CAO.mat

        Processed experimental amplitude data for the TiO2 point scattering sample with respectively DFT, DFT+ISAM, RFIAA, MIAA+ISAM and MIAA+ISAM+CAO. These datasets are used for fitting in figure 4 (except for CAO), and MIAA_ISAM and MIAA_ISAM_CAO are used for figure 6.
    

    simu_pointscat_image_DFT.mat

    simu_pointscat_image_RFIAA.mat

    simu_pointscat_image_DFT_ISAM.mat

    simu_pointscat_image_MIAA_ISAM.mat

        Processed amplitude data from the simulation dataset, which is used in the script for figure 4 for the resolution analysis.
    

    exp_leaf_image_MIAA_ISAM.mat

    exp_leaf_image_MIAA_ISAM_CAO.mat

        Processed amplitude data from the leaf sample, with and without aberration correction which is used to produce figure 6.
    

    exp_leaf_zernike_coefficients_CAO_normal_wmaf.mat

    exp_pointscat_zernike_coefficients_CAO_normal_wmaf.mat

        Estimated Zernike coefficients and the weighted moving average of them that is used for the computational aberration correction. Some of this data is plotted in Figure 6 of the manuscript.
    
    
        input_zernike_modes.mat
        The reference Zernike modes corresponding to the data that is loaded to give the modes the proper name.
    

    exp_pointscat_MIAA_ISAM_complex.mat

    exp_leaf_MIAA_ISAM_complex

        Complex MIAA+ISAM processed data that is used as input for the computational aberration correction.
    
  12. d

    Dataset metadata of known Dataverse installations

    • search.dataone.org
    • dataverse.harvard.edu
    • +1more
    Updated Nov 22, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gautier, Julian (2023). Dataset metadata of known Dataverse installations [Dataset]. http://doi.org/10.7910/DVN/DCDKZQ
    Explore at:
    Dataset updated
    Nov 22, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Gautier, Julian
    Description

    This dataset contains the metadata of the datasets published in 77 Dataverse installations, information about each installation's metadata blocks, and the list of standard licenses that dataset depositors can apply to the datasets they publish in the 36 installations running more recent versions of the Dataverse software. The data is useful for reporting on the quality of dataset and file-level metadata within and across Dataverse installations. Curators and other researchers can use this dataset to explore how well Dataverse software and the repositories using the software help depositors describe data. How the metadata was downloaded The dataset metadata and metadata block JSON files were downloaded from each installation on October 2 and October 3, 2022 using a Python script kept in a GitHub repo at https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_dataset_metadata_of_all_installations.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL in which I was able to create an account and another named "apikey" listing my accounts' API tokens. The Python script expects and uses the API tokens in this CSV file to get metadata and other information from installations that require API tokens. How the files are organized ├── csv_files_with_metadata_from_most_known_dataverse_installations │ ├── author(citation).csv │ ├── basic.csv │ ├── contributor(citation).csv │ ├── ... │ └── topic_classification(citation).csv ├── dataverse_json_metadata_from_each_known_dataverse_installation │ ├── Abacus_2022.10.02_17.11.19.zip │ ├── dataset_pids_Abacus_2022.10.02_17.11.19.csv │ ├── Dataverse_JSON_metadata_2022.10.02_17.11.19 │ ├── hdl_11272.1_AB2_0AQZNT_v1.0.json │ ├── ... │ ├── metadatablocks_v5.6 │ ├── astrophysics_v5.6.json │ ├── biomedical_v5.6.json │ ├── citation_v5.6.json │ ├── ... │ ├── socialscience_v5.6.json │ ├── ACSS_Dataverse_2022.10.02_17.26.19.zip │ ├── ADA_Dataverse_2022.10.02_17.26.57.zip │ ├── Arca_Dados_2022.10.02_17.44.35.zip │ ├── ... │ └── World_Agroforestry_-_Research_Data_Repository_2022.10.02_22.59.36.zip └── dataset_pids_from_most_known_dataverse_installations.csv └── licenses_used_by_dataverse_installations.csv └── metadatablocks_from_most_known_dataverse_installations.csv This dataset contains two directories and three CSV files not in a directory. One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 18 CSV files that contain the values from common metadata fields of all 77 Dataverse installations. For example, author(citation)_2022.10.02-2022.10.03.csv contains the "Author" metadata for all published, non-deaccessioned, versions of all datasets in the 77 installations, where there's a row for each author name, affiliation, identifier type and identifier. The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 77 zipped files, one for each of the 77 Dataverse installations whose dataset metadata I was able to download using Dataverse APIs. Each zip file contains a CSV file and two sub-directories: The CSV file contains the persistent IDs and URLs of each published dataset in the Dataverse installation as well as a column to indicate whether or not the Python script was able to download the Dataverse JSON metadata for each dataset. For Dataverse installations using Dataverse software versions whose Search APIs include each dataset's owning Dataverse collection name and alias, the CSV files also include which Dataverse collection (within the installation) that dataset was published in. One sub-directory contains a JSON file for each of the installation's published, non-deaccessioned dataset versions. The JSON files contain the metadata in the "Dataverse JSON" metadata schema. The other sub-directory contains information about the metadata models (the "metadata blocks" in JSON files) that the installation was using when the dataset metadata was downloaded. I saved them so that they can be used when extracting metadata from the Dataverse JSON files. The dataset_pids_from_most_known_dataverse_installations.csv file contains the dataset PIDs of all published datasets in the 77 Dataverse installations, with a column to indicate if the Python script was able to download the dataset's metadata. It's a union of all of the "dataset_pids_..." files in each of the 77 zip files. The licenses_used_by_dataverse_installations.csv file contains information about the licenses that a number of the installations let depositors choose when creating datasets. When I collected ... Visit https://dataone.org/datasets/sha256%3Ad27d528dae8cf01e3ea915f450426c38fd6320e8c11d3e901c43580f997a3146 for complete metadata about this dataset.

  13. h

    math-python-reasoning-dataset

    • huggingface.co
    Updated Feb 8, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sara Han Díaz (2025). math-python-reasoning-dataset [Dataset]. https://huggingface.co/datasets/sdiazlor/math-python-reasoning-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 8, 2025
    Authors
    Sara Han Díaz
    Description

    Dataset Card for my-distiset-3c1699f5

    This dataset has been created with distilabel.

      Dataset Summary
    

    This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/sdiazlor/my-distiset-3c1699f5/raw/main/pipeline.yaml"

    or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/sdiazlor/math-python-reasoning-dataset.

  14. Z

    MASCDB, a database of images, descriptors and microphysical properties of...

    • data.niaid.nih.gov
    • springerprofessional.de
    • +2more
    Updated Jul 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Grazioli, Jacopo; Ghiggi, Gionata; Berne, Alexis (2023). MASCDB, a database of images, descriptors and microphysical properties of individual snowflakes in free fall [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_5578920
    Explore at:
    Dataset updated
    Jul 5, 2023
    Dataset provided by
    EPFL-ENAC-IIE-LTE
    Authors
    Grazioli, Jacopo; Ghiggi, Gionata; Berne, Alexis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset overview

    This dataset provides data and images of snowflakes in free fall collected with a Multi-Angle Snowflake Camera (MASC) The dataset includes, for each recorded snowflakes:

    A triplet of gray-scale images corresponding to the three cameras of the MASC

    A large quantity of geometrical, textural descriptors and the pre-compiled output of published retrieval algorithms as well as basic environmental information at the location and time of each measurement.

    The pre-computed descriptors and retrievals are available either individually for each camera view or, some of them, available as descriptors of the triplet as a whole. A non exhaustive list of precomputed quantities includes for example:

    Textural and geometrical descriptors as in Praz et al 2017

    Hydrometeor classification, riming degree estimation, melting identification, as in Praz et al 2017

    Blowing snow identification, as in Schaer et al 2020

    Mass, volume, gyration estimation, as in Leinonen et al 2021

    Data format and structure

    The dataset is divided into four .parquet file (for scalar descriptors) and a Zarr database (for the images). A detailed description of the data content and of the data records is available here.

    Supporting code

    A python-based API is available to manipulate, display and organize the data of our dataset. It can be found on GitHub. See also the code documentation on ReadTheDocs.

    Download notes

    All files available here for download should be stored in the same folder, if the python-based API is used

    MASCdb.zarr.zip must be unzipped after download

    Field campaigns

    A list of campaigns included in the dataset, with a minimal description is given in the following table

        Campaign_name
        Information
    

    Shielded / Not shielded

    DFIR = Double Fence Intercomparison Reference

    APRES3-2016 & APRES3-2017

        Instrument installed in Antarctica in the context of the APRES3 project. See for example Genthon et al, 2018 or Grazioli et al 2017
        Not shielded
    
    
        Davos-2015
        Instrument installed in the Swiss Alps within the context of SPICE (Solid Precipitation InterComparison Experiment)
        Shielded (DFIR)
    
    
        Davos-2019
        Instrument installed in the Swiss Alps within the context of RACLETS (Role of Aerosols and CLouds Enhanced by Topography on Snow)
        Not shielded
    
    
        ICEGENESIS-2021
        Instrument installed in the Swiss Jura in a MeteoSwiss ground measurement site, within the context of ICE-GENESIS. See for example Billault-Roux et al, 2023
        Not shielded
    
    
        ICEPOP-2018
        Instrument installed in Korea, in the context of ICEPOP. See for example Gehring et al 2021.
        Shielded (DFIR)
    
    
        Jura-2019 & Jura-2023
        Instrument installed in the Swiss Jura within a MeteoSwiss measurement site
        Not shielded
    
    
        Norway-2016
        Instrument installed in Norway during the High-Latitude Measurement of Snowfall (HiLaMS). See for example Cooper et al, 2022.
        Not shielded
    
    
        PLATO-2019
        Instrument installed in the "Davis" Antarctic base during the PLATO field campaign
        Not shielded
    
    
        POPE-2020
        Instrument installed in the "Princess Elizabeth Antarctica" base during the POPE campaign. See for example Ferrone et al, 2023.
        Not shielded
    
    
        Remoray-2022
        Instrument installed in the French Jura.
        Not shielded
    
    
        Valais-2016
        Instrument installed in the Swiss Alps in a ski resort.
        Not shielded
    

    Version

    1.0 - Two new campaigns ("Jura-2023", "Norway-2016") added. Added references and list of campaigns.

    0.3 - a new campaign is added to the dataset ("Remoray-2022")

    0.2 - rename of variables. Variable precision (digits) standardized

    0.1 - first upload

  15. Data from: Satellite remote sensing dataset of Sentinel-2 for phenology...

    • zenodo.org
    • producciocientifica.uv.es
    • +1more
    txt
    Updated Apr 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dessislava Ganeva; Dessislava Ganeva; Lukas Graf Valentin; Lukas Graf Valentin; Egor Prikaziuk; Egor Prikaziuk; Gerbrand Koren; Gerbrand Koren; Enrico Tomelleri; Enrico Tomelleri; Jochem Verrelst; Jochem Verrelst; Katja Berger; Katja Berger; Santiago Belda; Santiago Belda; Zhanzhang Cai; Zhanzhang Cai; Cláudio Silva Figueira; Cláudio Silva Figueira (2023). Satellite remote sensing dataset of Sentinel-2 for phenology metrics extraction from sites in Bulgaria and France [Dataset]. http://doi.org/10.5281/zenodo.7825727
    Explore at:
    txtAvailable download formats
    Dataset updated
    Apr 28, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Dessislava Ganeva; Dessislava Ganeva; Lukas Graf Valentin; Lukas Graf Valentin; Egor Prikaziuk; Egor Prikaziuk; Gerbrand Koren; Gerbrand Koren; Enrico Tomelleri; Enrico Tomelleri; Jochem Verrelst; Jochem Verrelst; Katja Berger; Katja Berger; Santiago Belda; Santiago Belda; Zhanzhang Cai; Zhanzhang Cai; Cláudio Silva Figueira; Cláudio Silva Figueira
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Site Description:

    In this dataset, there are seventeen production crop fields in Bulgaria where winter rapeseed and wheat were grown and two research fields in France where winter wheat – rapeseed – barley – sunflower and winter wheat – irrigated maize crop rotation is used. The full description of those fields is in the database "In-situ crop phenology dataset from sites in Bulgaria and France" (doi.org/10.5281/zenodo.7875440).

    Methodology and Data Description:

    Remote sensing data is extracted from Sentinel-2 tiles 35TNJ for Bulgarian sites and 31TCJ for French sites on the day of the overpass since September 2015 for Sentinel-2 derived vegetation indices and since October 2016 for HR-VPP products. To suppress spectral mixing effects at the parcel boundaries, as highlighted by Meier et al., 2020, the values from all datasets were subgrouped per field and then aggregated to a single median value for further analysis.

    Sentinel-2 data was downloaded for all test sites from CREODIAS (https://creodias.eu/) in L2A processing level using a maximum scene-wide cloudy cover threshold of 75%. Scenes before 2017 were available in L1C processing level only. Scenes in L1C processing level were corrected for atmospheric effects after downloading using Sen2Cor (v2.9) with default settings. This was the same version used for the L2A scenes obtained intermediately from CREODIAS.

    Next, the data was extracted from the Sentinel-2 scenes for each field parcel where only SCL classes 4 (vegetation) and 5 (bare soil) pixels were kept. We resampled the 20m band B8A to match the spatial resolution of the green and red band (10m) using nearest neighbor interpolation. The entire image processing chain was carried out using the open-source Python Earth Observation Data Analysis Library (EOdal) (Graf et al., 2022).

    Apart from the widely used Normalized Difference Vegetation Index (NDVI) and Enhanced Vegetation Index (EVI), we included two recently proposed indices that were reported to have a higher correlation with photosynthesis and drought response of vegetation: These were the Near-Infrared Reflection of Vegetation (NIRv) (Badgley et al., 2017) and Kernel NDVI (kNDVI) (Camps-Valls et al., 2021). We calculated the vegetation indices in two different ways:

    First, we used B08 as near-infrared (NIR) band which comes in a native spatial resolution of 10 m. B08 (central wavelength 833 nm) has a relatively coarse spectral resolution with a bandwidth of 106 nm.

    Second, we used B8A which is available at 20 m spatial resolution. B8A differs from B08 in its central wavelength (864 nm) and has a narrower bandwidth (21 nm or 22 nm in the case of Sentinel-2A and 2B, respectively) compared to B08.

    The High Resolution Vegetation Phenology and Productivity (HR-VPP) dataset from Copernicus Land Monitoring Service (CLMS) has three 10-m set products of Sentinel-2: vegetation indices, vegetation phenology and productivity parameters and seasonal trajectories (Tian et al., 2021). Both vegetation indices, Normalized Vegetation Index (NDVI) and Plant Phenology (PPI) and plant parameters, Fraction of Absorbed Photosynthetic Active Radiation (FAPAR) and Leaf Area Index (LAI) were computed for the time of Sentinel-2 overpass by the data provider.

    NDVI is computed directly from B04 and B08 and PPI is computed using Difference Vegetation Index (DVI = B08 - B04) and its seasonal maximum value per pixel. FAPAR and LAI are retrieved from B03 and B04 and B08 with neural network training on PROSAIL model simulations. The dataset has a quality flag product (QFLAG2) which is a 16-bit that extends the scene classification band (SCL) of the Sentinel-2 Level-2 products. A “medium” filter was used to mask out QFLAG2 values from 2 to 1022, leaving land pixels (bit 1) within or outside cloud proximity (bits 11 and 13) or cloud shadow proximity (bits 12 and 14).

    The HR-VPP daily raw vegetation indices products are described in detail in the user manual (Smets et al., 2022) and the computations details of PPI are given by Jin and Eklundh (2014). Seasonal trajectories refer to the 10-daily smoothed time-series of PPI used for vegetation phenology and productivity parameters retrieval with TIMESAT (Jönsson and Eklundh 2002, 2004).

    HR-VPP data was downloaded through the WEkEO Copernicus Data and Information Access Services (DIAS) system with a Python 3.8.10 harmonized data access (HDA) API 0.2.1. Zonal statistics [’min’, ’max’, ’mean’, ’median’, ’count’, ’std’, ’majority’] were computed on non-masked pixel values within field boundaries with rasterstats Python package 0.17.00.

    The Start of season date (SOSD), end of season date (EOSD) and length of seasons (LENGTH) were extracted from the annual Vegetation Phenology and Productivity Parameters (VPP) dataset as an additional source for comparison. These data are a product of the Vegetation Phenology and Productivity Parameters, see (https://land.copernicus.eu/pan-european/biophysical-parameters/high-resolution-vegetation-phenology-and-productivity/vegetation-phenology-and-productivity) for detailed information.

    File Description:

    4 datasets:

    1_senseco_data_S2_B08_Bulgaria_France; 1_senseco_data_S2_B8A_Bulgaria_France; 1_senseco_data_HR_VPP_Bulgaria_France; 1_senseco_data_phenology_VPP_Bulgaria_France

    3 metadata:

    2_senseco_metadata_S2_B08_B8A_Bulgaria_France; 2_senseco_metadata_HR_VPP_Bulgaria_France; 2_senseco_metadata_phenology_VPP_Bulgaria_France

    The dataset files “1_senseco_data_S2_B8_Bulgaria_France” and “1_senseco_data_S2_B8A_Bulgaria_France” concerns all vegetation indices (EVI, NDVI, kNDVI, NIRv) data values and related information, and metadata file “2_senseco_metadata_S2_B08_B8A_Bulgaria_France” describes all the existing variables. Both “1_senseco_data_S2_B8_Bulgaria_France” and “1_senseco_data_S2_B8A_Bulgaria_France” have the same column variable names and for that reason, they share the same metadata file “2_senseco_metadata_S2_B08_B8A_Bulgaria_France”.

    The dataset file “1_senseco_data_HR_VPP_Bulgaria_France” concerns vegetation indices (NDVI, PPI) and plant parameters (LAI, FAPAR) data values and related information, and metadata file “2_senseco_metadata_HRVPP_Bulgaria_France” describes all the existing variables.

    The dataset file “1_senseco_data_phenology_VPP_Bulgaria_France” concerns the vegetation phenology and productivity parameters (LENGTH, SOSD, EOSD) values and related information, and metadata file “2_senseco_metadata_VPP_Bulgaria_France” describes all the existing variables.

    Bibliography

    G. Badgley, C.B. Field, J.A. Berry, Canopy near-infrared reflectance and terrestrial photosynthesis, Sci. Adv. 3 (2017) e1602244. https://doi.org/10.1126/sciadv.1602244.

    G. Camps-Valls, M. Campos-Taberner, Á. Moreno-Martínez, S. Walther, G. Duveiller, A. Cescatti, M.D. Mahecha, J. Muñoz-Marí, F.J. García-Haro, L. Guanter, M. Jung, J.A. Gamon, M. Reichstein, S.W. Running, A unified vegetation index for quantifying the terrestrial biosphere, Sci. Adv. 7 (2021) eabc7447. https://doi.org/10.1126/sciadv.abc7447.

    L.V. Graf, G. Perich, H. Aasen, EOdal: An open-source Python package for large-scale agroecological research using Earth Observation and gridded environmental data, Comput. Electron. Agric. 203 (2022) 107487. https://doi.org/10.1016/j.compag.2022.107487.

    H. Jin, L. Eklundh, A physically based vegetation index for improved monitoring of plant phenology, Remote Sens. Environ. 152 (2014) 512–525. https://doi.org/10.1016/j.rse.2014.07.010.

    P. Jonsson, L. Eklundh, Seasonality extraction by function fitting to time-series of satellite sensor data, IEEE Trans. Geosci. Remote Sens. 40 (2002) 1824–1832. https://doi.org/10.1109/TGRS.2002.802519.

    P. Jönsson, L. Eklundh, TIMESAT—a program for analyzing time-series of satellite sensor data, Comput. Geosci. 30 (2004) 833–845. https://doi.org/10.1016/j.cageo.2004.05.006.

    J. Meier, W. Mauser, T. Hank, H. Bach, Assessments on the impact of high-resolution-sensor pixel sizes for common agricultural policy and smart farming services in European regions, Comput. Electron. Agric. 169 (2020) 105205. https://doi.org/10.1016/j.compag.2019.105205.

    B. Smets, Z. Cai, L. Eklund, F. Tian, K. Bonte, R. Van Hoost, R. Van De Kerchove, S. Adriaensen, B. De Roo, T. Jacobs, F. Camacho, J. Sánchez-Zapero, S. Else, H. Scheifinger, K. Hufkens, P. Jönsson, HR-VPP Product User Manual Vegetation Indices, 2022.

    F. Tian, Z. Cai, H. Jin, K. Hufkens, H. Scheifinger, T. Tagesson, B. Smets, R. Van Hoolst, K. Bonte, E. Ivits, X. Tong, J. Ardö, L. Eklundh, Calibrating vegetation phenology from Sentinel-2 using eddy covariance, PhenoCam, and PEP725 networks across Europe, Remote Sens. Environ. 260 (2021) 112456. https://doi.org/10.1016/j.rse.2021.112456.

  16. Code4ML 2.0

    • zenodo.org
    csv, txt
    Updated May 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonimous authors; Anonimous authors (2025). Code4ML 2.0 [Dataset]. http://doi.org/10.5281/zenodo.15465737
    Explore at:
    csv, txtAvailable download formats
    Dataset updated
    May 19, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonimous authors; Anonimous authors
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is an enriched version of the Code4ML dataset, a large-scale corpus of annotated Python code snippets, competition summaries, and data descriptions sourced from Kaggle. The initial release includes approximately 2.5 million snippets of machine learning code extracted from around 100,000 Jupyter notebooks. A portion of these snippets has been manually annotated by human assessors through a custom-built, user-friendly interface designed for this task.

    The original dataset is organized into multiple CSV files, each containing structured data on different entities:

    • code_blocks.csv: Contains raw code snippets extracted from Kaggle.
    • kernels_meta.csv: Metadata for the notebooks (kernels) from which the code snippets were derived.
    • competitions_meta.csv: Metadata describing Kaggle competitions, including information about tasks and data.
    • markup_data.csv: Annotated code blocks with semantic types, allowing deeper analysis of code structure.
    • vertices.csv: A mapping from numeric IDs to semantic types and subclasses, used to interpret annotated code blocks.

    Table 1. code_blocks.csv structure

    ColumnDescription
    code_blocks_indexGlobal index linking code blocks to markup_data.csv.
    kernel_idIdentifier for the Kaggle Jupyter notebook from which the code block was extracted.
    code_block_id

    Position of the code block within the notebook.

    code_block

    The actual machine learning code snippet.

    Table 2. kernels_meta.csv structure

    ColumnDescription
    kernel_idIdentifier for the Kaggle Jupyter notebook.
    kaggle_scorePerformance metric of the notebook.
    kaggle_commentsNumber of comments on the notebook.
    kaggle_upvotesNumber of upvotes the notebook received.
    kernel_linkURL to the notebook.
    comp_nameName of the associated Kaggle competition.

    Table 3. competitions_meta.csv structure

    ColumnDescription
    comp_nameName of the Kaggle competition.
    descriptionOverview of the competition task.
    data_typeType of data used in the competition.
    comp_typeClassification of the competition.
    subtitleShort description of the task.
    EvaluationAlgorithmAbbreviationMetric used for assessing competition submissions.
    data_sourcesLinks to datasets used.
    metric typeClass label for the assessment metric.

    Table 4. markup_data.csv structure

    ColumnDescription
    code_blockMachine learning code block.
    too_longFlag indicating whether the block spans multiple semantic types.
    marksConfidence level of the annotation.
    graph_vertex_idID of the semantic type.

    The dataset allows mapping between these tables. For example:

    • code_blocks.csv can be linked to kernels_meta.csv via the kernel_id column.
    • kernels_meta.csv is connected to competitions_meta.csv through comp_name. To maintain quality, kernels_meta.csv includes only notebooks with available Kaggle scores.

    In addition, data_with_preds.csv contains automatically classified code blocks, with a mapping back to code_blocks.csvvia the code_blocks_index column.

    Code4ML 2.0 Enhancements

    The updated Code4ML 2.0 corpus introduces kernels extracted from Meta Kaggle Code. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.

    Notebooks in kernels_meta2.csv may not have a Kaggle score but include a leaderboard ranking (rank), providing additional context for evaluation.

    competitions_meta_2.csv is enriched with data_cards, decsribing the data used in the competitions.

    Applications

    The Code4ML 2.0 corpus is a versatile resource, enabling training and evaluation of models in areas such as:

    • Code generation
    • Code understanding
    • Natural language processing of code-related tasks
  17. Data and tools for studying isograms

    • figshare.com
    Updated Jul 31, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Florian Breit (2017). Data and tools for studying isograms [Dataset]. http://doi.org/10.6084/m9.figshare.5245810.v1
    Explore at:
    application/x-sqlite3Available download formats
    Dataset updated
    Jul 31, 2017
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Florian Breit
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):

    Label Data type Description

    isogramy int The order of isogramy, e.g. "2" is a second order isogram

    length int The length of the word in letters

    word text The actual word/isogram in ASCII

    source_pos text The Part of Speech tag from the original corpus

    count int Token count (total number of occurences)

    vol_count int Volume count (number of different sources which contain the word)

    count_per_million int Token count per million words

    vol_count_as_percent int Volume count as percentage of the total number of volumes

    is_palindrome bool Whether the word is a palindrome (1) or not (0)

    is_tautonym bool Whether the word is a tautonym (1) or not (0)

    The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:

    Label

    Data type

    Description

    !total_1grams

    int

    The total number of words in the corpus

    !total_volumes

    int

    The total number of volumes (individual sources) in the corpus

    !total_isograms

    int

    The total number of isograms found in the corpus (before compacting)

    !total_palindromes

    int

    How many of the isograms found are palindromes

    !total_tautonyms

    int

    How many of the isograms found are tautonyms

    The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.

  18. Z

    polyOne Data Set - 100 million hypothetical polymers including 29 properties...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Mar 24, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christopher Kuenneth; Rampi Ramprasad (2023). polyOne Data Set - 100 million hypothetical polymers including 29 properties [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7124187
    Explore at:
    Dataset updated
    Mar 24, 2023
    Dataset provided by
    Georgia Institute of Technology
    Authors
    Christopher Kuenneth; Rampi Ramprasad
    Description

    polyOne Data Set

    The data set contains 100 million hypothetical polymers each with 29 predicted properties using machine learning models. We use PSMILES strings to represent polymer structures, see here and here. The polymers are generated by decomposing previously synthesized polymers into unique chemical fragments. Random and enumerative compositions of these fragments yield 100 million hypothetical PSMILES strings. All PSMILES strings are chemically valid polymers but, mostly, have never been synthesized before. More information can be found in the paper. Please note the license agreement in the LICENSE file.

    Full data set including the properties

    The data files are in Apache Parquet format. The files start with polyOne_*.parquet.

    I recommend using dask (pip install dask) to load and process the data set. Pandas also works but is slower.

    Load sharded data set with dask python import dask.dataframe as dd ddf = dd.read_parquet("*.parquet", engine="pyarrow")

    For example, compute the description of data set ```python df_describe = ddf.describe().compute() df_describe

    
    
    PSMILES strings only
    
    
    
      
    generated_polymer_smiles_train.txt - 80 million PSMILES strings for training polyBERT. One string per line.
      
    generated_polymer_smiles_dev.txt - 20 million PSMILES strings for testing polyBERT. One string per line.
    
  19. h

    code-search-net-python

    • huggingface.co
    Updated Dec 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fernando Tarin Morales (2023). code-search-net-python [Dataset]. https://huggingface.co/datasets/Nan-Do/code-search-net-python
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 27, 2023
    Authors
    Fernando Tarin Morales
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for "code-search-net-python"

      Dataset Description
    

    Homepage: None Repository: https://huggingface.co/datasets/Nan-Do/code-search-net-python Paper: None Leaderboard: None Point of Contact: @Nan-Do

      Dataset Summary
    

    This dataset is the Python portion of the CodeSarchNet annotated with a summary column.The code-search-net dataset includes open source functions that include comments found at GitHub.The summary is a short description of what the… See the full description on the dataset page: https://huggingface.co/datasets/Nan-Do/code-search-net-python.

  20. h

    CoT_Reasoning_Python_General_Query

    • huggingface.co
    Updated Apr 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matt R. Wesney (2025). CoT_Reasoning_Python_General_Query [Dataset]. https://huggingface.co/datasets/moremilk/CoT_Reasoning_Python_General_Query
    Explore at:
    Dataset updated
    Apr 16, 2025
    Authors
    Matt R. Wesney
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Okay, here's a revised description for the new dataset "CoT_Reasoning_Python_General_Query": CoT Reasoning Python General Query: Enhancing Python Understanding through Chain of Thought Reasoning

    Description: Explore Python programming and general computing queries with the "CoT_Reasoning_Python_General_Query" dataset. This open-source resource (MIT licensed) provides a carefully curated collection of question-and-answer pairs designed to train AI models in understanding and reasoning about a… See the full description on the dataset page: https://huggingface.co/datasets/moremilk/CoT_Reasoning_Python_General_Query.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Schaumlöffel, Timothy; Roig, Gemma; Choksi, Bhavin (2024). Multimodal Vision-Audio-Language Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10060784

Multimodal Vision-Audio-Language Dataset

Explore at:
Dataset updated
Jul 11, 2024
Dataset provided by
Goethe University Frankfurt
Authors
Schaumlöffel, Timothy; Roig, Gemma; Choksi, Bhavin
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities. Details can be found in the attached report. Annotation The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library. The split into train, validation and test set follows the split of the original datasets. Installation

pip install pandas pyarrow Example

import pandas as pddf = pd.read_parquet('annotation_train.parquet', engine='pyarrow')print(df.iloc[0])

dataset AudioSet filename train/---2_BBVHAA.mp3 captions_visual [a man in a black hat and glasses.] captions_auditory [a man speaks and dishes clank.] tags [Speech] Description The annotation file consists of the following fields:filename: Name of the corresponding file (video or audio file)dataset: Source dataset associated with the data pointcaptions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual contentcaptions_auditory: A list of captions related to the auditory content of the videotags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided Data files The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de

Search
Clear search
Close search
Google apps
Main menu