89 datasets found
  1. Data Pre-Processing : Data Integration

    • kaggle.com
    zip
    Updated Aug 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mr.Machine (2022). Data Pre-Processing : Data Integration [Dataset]. https://www.kaggle.com/ilayaraja07/data-preprocessing-data-integration
    Explore at:
    zip(2327 bytes)Available download formats
    Dataset updated
    Aug 2, 2022
    Authors
    Mr.Machine
    Description

    In this exercise, we'll merge the details of students from two datasets, namely student.csv and marks.csv. The student dataset contains columns such as Age, Gender, Grade, and Employed. The marks.csv dataset contains columns such as Mark and City. The Student_id column is common between the two datasets. Follow these steps to complete this exercise

  2. T

    imagenet2012_subset

    • tensorflow.org
    Updated Oct 21, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). imagenet2012_subset [Dataset]. https://www.tensorflow.org/datasets/catalog/imagenet2012_subset
    Explore at:
    Dataset updated
    Oct 21, 2024
    Description

    ILSVRC 2012, commonly known as 'ImageNet' is an image dataset organized according to the WordNet hierarchy. Each meaningful concept in WordNet, possibly described by multiple words or word phrases, is called a "synonym set" or "synset". There are more than 100,000 synsets in WordNet, majority of them are nouns (80,000+). In ImageNet, we aim to provide on average 1000 images to illustrate each synset. Images of each concept are quality-controlled and human-annotated. In its completion, we hope ImageNet will offer tens of millions of cleanly sorted images for most of the concepts in the WordNet hierarchy.

    The test split contains 100K images but no labels because no labels have been publicly released. We provide support for the test split from 2012 with the minor patch released on October 10, 2019. In order to manually download this data, a user must perform the following operations:

    1. Download the 2012 test split available here.
    2. Download the October 10, 2019 patch. There is a Google Drive link to the patch provided on the same page.
    3. Combine the two tar-balls, manually overwriting any images in the original archive with images from the patch. According to the instructions on image-net.org, this procedure overwrites just a few images.

    The resulting tar-ball may then be processed by TFDS.

    To assess the accuracy of a model on the ImageNet test split, one must run inference on all images in the split, export those results to a text file that must be uploaded to the ImageNet evaluation server. The maintainers of the ImageNet evaluation server permits a single user to submit up to 2 submissions per week in order to prevent overfitting.

    To evaluate the accuracy on the test split, one must first create an account at image-net.org. This account must be approved by the site administrator. After the account is created, one can submit the results to the test server at https://image-net.org/challenges/LSVRC/eval_server.php The submission consists of several ASCII text files corresponding to multiple tasks. The task of interest is "Classification submission (top-5 cls error)". A sample of an exported text file looks like the following:

    771 778 794 387 650
    363 691 764 923 427
    737 369 430 531 124
    755 930 755 59 168
    

    The export format is described in full in "readme.txt" within the 2013 development kit available here: https://image-net.org/data/ILSVRC/2013/ILSVRC2013_devkit.tgz Please see the section entitled "3.3 CLS-LOC submission format". Briefly, the format of the text file is 100,000 lines corresponding to each image in the test split. Each line of integers correspond to the rank-ordered, top 5 predictions for each test image. The integers are 1-indexed corresponding to the line number in the corresponding labels file. See labels.txt.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('imagenet2012_subset', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

    https://storage.googleapis.com/tfds-data/visualization/fig/imagenet2012_subset-1pct-5.0.0.png" alt="Visualization" width="500px">

  3. d

    Combined wildfire datasets for the United States and certain territories,...

    • catalog.data.gov
    • data.usgs.gov
    Updated Nov 20, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Combined wildfire datasets for the United States and certain territories, 1800s-Present (combined wildland fire polygons) [Dataset]. https://catalog.data.gov/dataset/combined-wildfire-datasets-for-the-united-states-and-certain-territories-1800s-present-com
    Explore at:
    Dataset updated
    Nov 20, 2025
    Dataset provided by
    U.S. Geological Survey
    Area covered
    United States
    Description

    First, we would like to thank the wildland fire advisory group. Their wisdom and guidance helped us build the dataset as it currently exists. Currently, there are multiple, freely available fire datasets that identify wildfire and prescribed fire burned areas across the United States. However, these datasets are all limited in some way. Their time periods could cover only a couple of decades or they may have stopped collecting data many years ago. Their spatial footprints may be limited to a specific geographic area or agency. Their attribute data may be limited to nothing more than a polygon and a year. None of the existing datasets provides a comprehensive picture of fires that have burned throughout the last few centuries. Our dataset uses these existing layers and utilizes a series of both manual processes and ArcGIS Python (arcpy) scripts to merge these existing datasets into a single dataset that encompasses the known wildfires and prescribed fires within the United States and certain territories. Forty different fire layers were utilized in this dataset. First, these datasets were ranked by order of observed quality (Tiers). The datasets were given a common set of attribute fields and as many of these fields were populated as possible within each dataset. All fire layers were then merged together (the merged dataset) by their common attributes to created a merged dataset containing all fire polygons. Polygons were then processed in order of Tier (1-8) so that overlapping polygons in the same year and Tier were dissolved together. Overlapping polygons in subsequent Tiers were removed from the dataset. Attributes from the original datasets of all intersecting polygons in the same year across all Tiers were also merged so that all attributes from all Tiers were included, but only the polygons from the highest ranking Tier were dissolved to form the fire polygon. The resulting product (the combined dataset) has only one fire per year in a given area with one set of attributes. While it combines wildfire data from 40 wildfire layers and therefore has more complete information on wildfires than the datasets that went into it, this dataset has also has its own set of limitations. Please see the Data Quality attributes within the metadata record for additional information on this dataset's limitations. Overall, we believe this dataset is designed be to a comprehensive collection of fire boundaries within the United States and provides a more thorough and complete picture of fires across the United States when compared to the datasets that went into it.

  4. Z

    #PraCegoVer dataset

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Jan 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gabriel Oliveira dos Santos; Esther Luna Colombini; Sandra Avila (2023). #PraCegoVer dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5710561
    Explore at:
    Dataset updated
    Jan 19, 2023
    Dataset provided by
    Institute of Computing, University of Campinas
    Authors
    Gabriel Oliveira dos Santos; Esther Luna Colombini; Sandra Avila
    Description

    Automatically describing images using natural sentences is an essential task to visually impaired people's inclusion on the Internet. Although there are many datasets in the literature, most of them contain only English captions, whereas datasets with captions described in other languages are scarce.

    PraCegoVer arose on the Internet, stimulating users from social media to publish images, tag #PraCegoVer and add a short description of their content. Inspired by this movement, we have proposed the #PraCegoVer, a multi-modal dataset with Portuguese captions based on posts from Instagram. It is the first large dataset for image captioning in Portuguese with freely annotated images.

    PraCegoVer has 533,523 pairs with images and captions described in Portuguese collected from more than 14 thousand different profiles. Also, the average caption length in #PraCegoVer is 39.3 words and the standard deviation is 29.7.

    Dataset Structure

    PraCegoVer dataset is composed of the main file dataset.json and a collection of compressed files named images.tar.gz.partX

    containing the images. The file dataset.json comprehends a list of json objects with the attributes:

    user: anonymized user that made the post;

    filename: image file name;

    raw_caption: raw caption;

    caption: clean caption;

    date: post date.

    Each instance in dataset.json is associated with exactly one image in the images directory whose filename is pointed by the attribute filename. Also, we provide a sample with five instances, so the users can download the sample to get an overview of the dataset before downloading it completely.

    Download Instructions

    If you just want to have an overview of the dataset structure, you can download sample.tar.gz. But, if you want to use the dataset, or any of its subsets (63k and 173k), you must download all the files and run the following commands to uncompress and join the files:

    cat images.tar.gz.part* > images.tar.gz tar -xzvf images.tar.gz

    Alternatively, you can download the entire dataset from the terminal using the python script download_dataset.py available in PraCegoVer repository. In this case, first, you have to download the script and create an access token here. Then, you can run the following command to download and uncompress the image files:

    python download_dataset.py --access_token=

  5. DNAformer Datasets

    • zenodo.org
    application/gzip, bin +2
    Updated Dec 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Omer Sabary; Omer Sabary (2024). DNAformer Datasets [Dataset]. http://doi.org/10.5281/zenodo.13896773
    Explore at:
    zip, application/gzip, txt, binAvailable download formats
    Dataset updated
    Dec 8, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Omer Sabary; Omer Sabary
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Bar-Lev, D., Orr, I., Sabary, O., Etzion T., & Yakkobi, E. Scalable and robust DNA-based storage via coding theory and deep learning. 2024.

    Datasets description

    This document provides an overview of the 5 datasets introduced in this work. For each dataset we provide both the raw .fastq files with the sequenced reads, as well as a file that includes the processed binned reads that were obtained by the binning step described in the paper.

    The dataset is provided under similar license as the code repository, with scripts for loading and processing the data at: https://github.com/itaiorr/Deep-DNA-based-storage.git



    The datasets

    The data was synthesized using Twist Bioscience and are differentiated by the sequencing technology used. Two Illumina datasets, both generated by Illumina miSeq. The reads in these two datasets were sequenced with paired-end sequencing, while the merging (stitching) was done with PEAR software. We include both raw reads and stitched reads in our repository under the names:

    1. Pilot Illumina dataset

      1. BinnedPilotIllumina.txt - include the pilot dataset in binned format.
      2. P-Pilot_S2_L001_R1_001.fastq.gz - include the pilot reads pre-stitching as obtained from Illumina miSeq.
      3. P-Pilot_S2_L001_R2_001.fastq.gz - include the pilot reads pre-stitching as obtained from Illumina miSeq.
      4. Pilot_Illumina_raw_reads.fastq - includes the reads post-stitching.

    2. Test Illumina dataset

      1. BinnedTestIllumina.txt - Includes the test dataset in binned format.
      2. F1-Full-Pool_S1_L001_R1_001.fastq.gz - includes the Illumina reads pre-stitching as obtained from Illumina miSeq.
      3. F1-Full-Pool_S1_L001_R2_001.fastq.gz- includes the Illumina reads pre-stitching as obtained from Illumina miSeq.
      4. test_illumina_raw_reads.fastq - Includes the reads post-stitching.

    Three Nanopore datasets, all generated by Oxford Nanopore Technologies MinION under the names:

    1. Pilot Nanopore dataset

      1. BinnedPilotNanopore.txt - reads in binned format.
      2. raw_reads_pilot_nanopore.zip - original basecalled reads as obtained from ONT MinION.
      3. Pilot_RawSignals_1_5.zip , Pilot_RawSignals_6_10.zip , Pilot_RawSignals_11_13.zip - raw nanopore signals as obtained from ONT MinION.
    2. Test Nanopore first flowcell dataset (termed in the paper as “Nanopore single flowcell”).

      1. BinnedNanoporeFirstFlowcell.txt - reads in binned format.
      2. test_pool_nanopore_single.zip - original basecalled reads as obtained from ONT MinION.
      3. NanoporeFirstFlowcellRawSignals.zip - raw nanopore signals as obtained from ONT MinION.
    3. Test Nanopore second flowcells dataset

      1. BinnedNanoporeSecondFlowcell.txt - reads in binned format.
      2. test_nanopore_second_flowcell_part001.zip , test_nanopore_second_flowcell_part002.zip- original basecalled reads as obtained from ONT MinIN.
      3. NanoproeSecondFlowcellRawSignals_1_5.zip , NanoproeSecondFlowcellRawSignals_6_10.zip , NanoproeSecondFlowcellRawSignals_11_15.zip - raw nanopore signals as obtained from ONT MinION.

    Additionally, for completeness, we also included a file with the processed and binned reads of the test Nanopore dataset of the combined two flowcells dataset (termed in the paper as “Nanopore two flowcells”). This can be found in the file BinnedNanoporeTwoFlowcells.txt.



    Detailed description

    The binned format was created using the binning step described in the paper. Each cluster of reads appears in the file with a header followed by the reads. More specifically:

    1. The header consists of 2 lines; the first corresponds to the encoded sequence of the clusters, and the second is a line of 18x“*” that should be ignored

    2. The reads in the clusters are provided after the header, where each read is given in a separate line

    3. Each cluster ends with two empty lines





    Data processing

    To ease the processing of our datasets, we also provide the following Python scripts (see https://github.com/itaiorr/Deep-DNA-based-storage)

    1. Preprocessor.py - includes our preprocessing procedure of the raw reads. The procedure detects and truncates the primers

    2. Parser.py - parses the file of the binned reads and creates two Python dictionaries. In the first dictionary each key is an encoded sequence, and the value is a list of the reads in the cluster. In the second dictionary the keys are the index and the value is a list of the reads in the cluster.

  6. d

    Sodium ion polyanionic cathode material dataset

    • data.dtu.dk
    txt
    Updated Jul 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Hoffmann Petersen; Juan Maria García Lastra; Arghya Bhowmik; Jin Hyun Chang (2025). Sodium ion polyanionic cathode material dataset [Dataset]. http://doi.org/10.11583/DTU.27202446.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jul 11, 2025
    Dataset provided by
    Technical University of Denmark
    Authors
    Martin Hoffmann Petersen; Juan Maria García Lastra; Arghya Bhowmik; Jin Hyun Chang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We have created a database that includes both static and dynamic structures of four sodium ion polyanionic cathode materials NaMPO4(olivine) ,NaMPO4(maricite), Na2MSiO4 and Na2.56M1.72(SO4)3 , along with various structures incorporating doping of transition metal ions (M). We consider four different transition metal ions (Fe, Mn, Co, Ni). Sampling was done using structure optimization, ab-initio molecular dynamics and machine learning driven dynamical sampling. The dataset consist of 113,703 structures.For each sampled structure, we record its crystal composition, total energy, atom-wise force vectors, atom-wise magnetic moments, and point charges obtained through Bader analysis. Our polyanionic Sodium ion battery database serves as a valuable addition to existing datasets, enabling the exploration of phase space while providing insights into the dynamic behavior of the materials.For the sampling density functional theory (DFT) calculation were performed using the Vienna Ab initio simulation package (VASP) version 6.4. The Perdew-Burke-Ernzerhof (PBE) functional with Hubbard-U corrections were appliedwas utilized for all calculations. The U-values are similar to the ones used for materials project (Fe: 5.3eV, Mn: 3.9eV, Co: 3.32eV, Ni: 6.2eV). For all calculations, an energy cutoff of 520eV was applied, with a smearing width of 0.01eV and convergence criteria set to 1e-5eV for energy and 0.03eV/Å for forces. All calculations were performed with spin polarization. The k-points employed for the four materials were fixed, with NaMPO4(olivine) and NaMPO4(maricite) utilizing [3,4,6] gamma points, Na2MSiO4 employing [3,4,4] gamma points and Na2.56M1.72(SO4)3 utilizing [2,3,4] gamma points. When constructing supercells, the gamma point in the direction of cell enlargement was halved.The dataset, presented in XYZ format, along with a few Python scripts. The dataset is divided into single transition metal ions structures and multiple transition metal ion structures.This division is provided for each of the four cathode materials: NaMPO4(olivine) ,NaMPO4(maricite), Na2MSiO4 and Na2.56M1.72(SO4)3 . For example, Na2.56M1.72(SO4)3 structures are split into single transition metal ion types Na2M2SO4_alluadite_single.xyz and multiple transition metal ion types Na2M2SO4_alluadite_multiple.xyz. The combined dataset, consisting of 113,703 structures, is available in Combined.xyz.To extract structural compositions and physical properties, the ase.io.read function from ASE version 3.23.0 is used. An example of how to extract data and plot the physical properties is provided in https://github.com/dtu-energy/cathode-generation-workflow/tree/main/extract_data/read_data.py and https://github.com/dtu-energy/cathode-generation-workflow/tree/main/extract_data/utils.py contains two functions, one used to attached Bader charges to an ASE atom object an another to combine multiple XYZ data files.To cite the data please use the doi https://doi.org/10.11583/DTU.27202446

  7. Open University Learning Analytics Dataset

    • kaggle.com
    zip
    Updated Dec 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Open University Learning Analytics Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/open-university-learning-analytics-dataset
    Explore at:
    zip(44203263 bytes)Available download formats
    Dataset updated
    Dec 21, 2023
    Authors
    The Devastator
    Description

    Open University Learning Analytics Dataset

    Student Performance and Engagement Data at The Open University

    By UCI [source]

    About this dataset

    This dataset provides an intimate look into student performance and engagement. It grants researchers access to numerous salient metrics of academic performance which illuminate a broad spectrum of student behaviors: how students interact with online learning material; quantitative indicators reflecting their academic outcomes; as well as demographic data such as age group, gender, prior education level among others.

    The main objective of this dataset is to enable analysts and educators alike with empirical insights underpinning individualized learning experiences - specifically in identifying cases when students may be 'at risk'. Given that preventive early interventions have been shown to significantly mitigate chances of course or program withdrawal among struggling students - having accurate predictive measures such as this can greatly steer pedagogical strategies towards being more success oriented.

    One unique feature about this dataset is its intricate detailing. Not only does it provide overarching summaries on a per-student basis for each presented courses but it also furnishes data related to assessments (scores & submission dates) along with information on individuals' interactions within VLEs (virtual learning environments) - spanning different types like forums, content pages etc... Such comprehensive collation across multiple contextual layers helps paint an encompassing portrayal of student experience that can guide better instructional design.

    Due credit must be given when utilizing this database for research purposes through citation. Specifically referencing (Kuzilek et al., 2015) OU Analyse: Analysing At-Risk Students at The Open University published in Learning Analytics Review is required due to its seminal work related groundings regarding analysis methodologies stem from there.

    Immaterial aspects aside - it is important to note that protection of student privacy is paramount within this dataset's terms and conditions. Stringent anonymization techniques have been implemented across sensitive variables - while detailed, profiles can't be traced back to original respondents.

    How to use the dataset

    How To Use This Dataset:

    • Understanding Your Objectives: Ideal objectives for using this dataset could be to identify at-risk students before they drop out of a class or program, improving course design by analyzing how assignments contribute to final grades, or simply examining relationships between different variables and student performance.

    • Set up your Analytical Environment: Before starting any analysis make sure you have an analytical environment set up where you can load the CSV files included in this dataset. You can use Python notebooks (Jupyter), R Studio or Tableau based software in case you want visual representation as well.

    • Explore Data Individually: There are seven separate datasets available: Assessments; Courses; Student Assessment; Student Info; Vle (Virtual Learning Environment); Student Registeration and Student Vle. Load these CSVs separately into your environment and do an initial exploration of each one: find out what kind of data they contain (numerical/categorical), if they have missing values etc.

    • Merge Datasets As the core idea is to track a student’s journey through multiple courses over time, combining these datasets will provide insights from wider perspectives. One way could be merging them using common key columns such as 'code_module', 'code_presentation', & 'id_student'. But make sure that merge should depend on what question you're trying to answer.

    • Identify Key Metrics Your key metrics will depend on your objectives but might include: overall grade averages per course or assessment type/student/region/gender/age group etc., number of clicks in virtual learning environment, student registration status etc.

    • Run Your Analysis Now you can run queries to analyze the data relevant to your objectives. Try questions like: What factors most strongly predict whether a student will fail an assessment? or How does course difficulty or the number of allotments per week change students' scores?

    • Visualization: Visualizing your data can be crucial for understanding patterns and relationships between variables. Use graphs like bar plots, heatmaps, and histograms to represent different aspects of your analyses.

    • Actionable Insights: The final step is interpreting these results in ways that are meaningf...

  8. Z

    Data from: Enhancing Open Modification Searches via a Combined Approach...

    • data.niaid.nih.gov
    Updated Dec 2, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stefan Schulze; Aime Bienfait Igiraneza; Manuel Kösters; Johannes Leufken; Sebastian A. Leidel; Benjamin A. Garcia; Christian Fufezan; Mechthild Pohlschröder (2020). Enhancing Open Modification Searches via a Combined Approach Facilitated by Ursgal [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4299357
    Explore at:
    Dataset updated
    Dec 2, 2020
    Dataset provided by
    University of Bern
    Heidelberg University
    University of Pennsylvania
    Authors
    Stefan Schulze; Aime Bienfait Igiraneza; Manuel Kösters; Johannes Leufken; Sebastian A. Leidel; Benjamin A. Garcia; Christian Fufezan; Mechthild Pohlschröder
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The identification of peptide sequences and their post-translational modifications (PTMs) is a crucial step in the analysis of bottom-up proteomics data. The recent development of open modification search (OMS) engines allows virtually all PTMs to be searched for. This not only increases the number of spectra that can be matched to peptides but also greatly advances the understanding of biological roles of PTMs through the identification, and thereby facilitated quantification, of peptidoforms (peptide sequences and their potential PTMs). While the benefits of combining results from multiple protein database search engines has been established previously, similar approaches for OMS results are missing so far. Here, we compare and combine results from three different OMS engines, demonstrating an increase in peptide spectrum matches of 8-18%. The unification of search results furthermore allows for the combined downstream processing of search results, including the mapping to potential PTMs. Finally, we test for the ability of OMS engines to identify glycosylated peptides. The implementation of these engines in the Python framework Ursgal facilitates the straightforward application of OMS with unified parameters and results files, thereby enabling yet unmatched high-throughput, large-scale data analysis.

    This dataset includes all relevant results files, databases, and scripts that correspond to the accompanying journal article. Specifically, the following files are deposited:

    Homo_sapiens_PXD004452_results.zip: result files from OMS and CS for the dataset PXD004452

    Homo_sapiens_PXD013715_results.zip: result files from OMS and CS for the dataset PXD013715

    Haloferax_volcanii_PXD021874_results.zip: result files from OMS and CS for the dataset PXD021874

    Escherichia_coli_PXD000498_results.zip: result files from OMS and CS for the dataset PXD000498

    databases.zip: target-decoy databases for Homo sapiens, Escherichia coli and Haloferax volcanii as well as a glycan database for Homo sapiens

    scripts.zip: example scripts for all relevant steps of the analysis

    mzml_files.zip: mzML files for all included datasets

    ursgal.zip: current version of Ursgal (0.6.7) that has been used to generate the results (for most recent versions see https://github.com/ursgal/ursgal)

  9. d

    Hydrologic Unit (HUC8) Boundaries for Alaskan Watersheds

    • dataone.org
    • knb.ecoinformatics.org
    • +1more
    Updated Mar 20, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jared Kibele (2019). Hydrologic Unit (HUC8) Boundaries for Alaskan Watersheds [Dataset]. http://doi.org/10.5063/F14T6GM3
    Explore at:
    Dataset updated
    Mar 20, 2019
    Dataset provided by
    Knowledge Network for Biocomplexity
    Authors
    Jared Kibele
    Time period covered
    Jan 19, 2018
    Area covered
    Variables measured
    name, source, id_numeric, id_original
    Description

    The United States is divided and sub-divided into successively smaller hydrologic units which are classified into four levels: regions, sub-regions, accounting units, and cataloging units. The hydrologic units are arranged or nested within each other, from the largest geographic area (regions) to the smallest geographic area (cataloging units). Each hydrologic unit is identified by a unique hydrologic unit code (HUC) consisting of two to eight digits based on the four levels of classification in the hydrologic unit system. A shapefile (or geodatabase) of watersheds for the state of Alaska and parts of western Canada was created by merging two datasets: the U.S. Watershed Boundary Dataset (WBD) and the Government of Canada's National Hydro Network (NHN). Since many rivers in Alaska are transboundary, the NHN data is necessary to capture their watersheds. The WBD data can be found at https://catalog.data.gov/dataset/usgs-national-watershed-boundary-dataset-wbd-downloadable-data-collection-national-geospatial- and the NHN data can be found here: https://open.canada.ca/data/en/dataset/a4b190fe-e090-4e6d-881e-b87956c07977. The included python script was used to subset and merge the two datasets into the single dataset, archived here.

  10. d

    (HS 2) Automate Workflows using Jupyter notebook to create Large Extent...

    • search.dataone.org
    • hydroshare.org
    Updated Oct 19, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Young-Don Choi (2024). (HS 2) Automate Workflows using Jupyter notebook to create Large Extent Spatial Datasets [Dataset]. http://doi.org/10.4211/hs.a52df87347ef47c388d9633925cde9ad
    Explore at:
    Dataset updated
    Oct 19, 2024
    Dataset provided by
    Hydroshare
    Authors
    Young-Don Choi
    Description

    We implemented automated workflows using Jupyter notebooks for each state. The GIS processing, crucial for merging, extracting, and projecting GeoTIFF data, was performed using ArcPy—a Python package for geographic data analysis, conversion, and management within ArcGIS (Toms, 2015). After generating state-scale LES (large extent spatial) datasets in GeoTIFF format, we utilized the xarray and rioxarray Python packages to convert GeoTIFF to NetCDF. Xarray is a Python package to work with multi-dimensional arrays and rioxarray is rasterio xarray extension. Rasterio is a Python library to read and write GeoTIFF and other raster formats. Xarray facilitated data manipulation and metadata addition in the NetCDF file, while rioxarray was used to save GeoTIFF as NetCDF. These procedures resulted in the creation of three HydroShare resources (HS 3, HS 4 and HS 5) for sharing state-scale LES datasets. Notably, due to licensing constraints with ArcGIS Pro, a commercial GIS software, the Jupyter notebook development was undertaken on a Windows OS.

  11. h

    VanRossum-Alpaca

    • huggingface.co
    Updated Nov 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rasmus Rasmussen (2024). VanRossum-Alpaca [Dataset]. https://huggingface.co/datasets/theprint/VanRossum-Alpaca
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 15, 2024
    Authors
    Rasmus Rasmussen
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Homage to Python

    The VanRossum dataset is all Python! I used DataMix to combine a handful of highly rated Python-centric datasets, to get a sampling of each and create something new. This data set has 80,000 entries and is named after Guido Van Rossum, the man who invented Python back in 1991. See the VanRossum Collection on HF for all things related to this dataset.

      Alpaca / GPT
    

    There are 2 versions of this dataset available on Huggingface.

    VanRossum-GPT… See the full description on the dataset page: https://huggingface.co/datasets/theprint/VanRossum-Alpaca.

  12. O

    Connecticut State Parcel Layer 2023

    • data.ct.gov
    • s.cnmilf.com
    • +3more
    csv, xlsx, xml
    Updated Jan 29, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Office of Policy and Management (2025). Connecticut State Parcel Layer 2023 [Dataset]. https://data.ct.gov/Environment-and-Natural-Resources/Connecticut-State-Parcel-Layer-2023/v875-mr5r/data
    Explore at:
    xml, csv, xlsxAvailable download formats
    Dataset updated
    Jan 29, 2025
    Dataset authored and provided by
    Office of Policy and Management
    Area covered
    Connecticut
    Description

    The dataset has combined the Parcels and Computer-Assisted Mass Appraisal (CAMA) data for 2023 into a single dataset. This dataset is designed to make it easier for stakeholders and the GIS community to use and access the information as a geospatial dataset. Included in this dataset are geometries for all 169 municipalities and attribution from the CAMA data for all but one municipality. Pursuant to Section 7-100l of the Connecticut General Statutes, each municipality is required to transmit a digital parcel file and an accompanying assessor’s database file (known as a CAMA report), to its respective regional council of governments (COG) by May 1 annually.

    These data were gathered from the CT municipalities by the COGs and then submitted to CT OPM. This dataset was created on 12/08/2023 from data collected in 2022-2023. Data was processed using Python scripts and ArcGIS Pro, ensuring standardization and integration of the data.

    CAMA Notes:

    The CAMA underwent several steps to standardize and consolidate the information. Python scripts were used to concatenate fields and create a unique identifier for each entry. The resulting dataset contains 1,353,595 entries and information on property assessments and other relevant attributes.

    • CAMA was provided by the towns.

    • Canaan parcels are viewable, but no additional information is available since no CAMA data was submitted.

    Spatial Data Notes:

    Data processing involved merging the parcels from different municipalities using ArcGIS Pro and Python. The resulting dataset contains 1,247,506 parcels.

    • No alteration has been made to the spatial geometry of the data.

    • Fields that are associated with CAMA data were provided by towns.

    • The data fields that have information from the CAMA were sourced from the towns’ CAMA data.

    • If no field for the parcels was provided for linking back to the CAMA by the town a new field within the original data was selected if it had a match rate above 50%, that joined back to the CAMA.

    • Linking fields were renamed to "Link".

    • All linking fields had a census town code added to the beginning of the value to create a unique identifier per town.

    • Any field that was not town name, Location, Editor, Edit Date, or a field associated back to the CAMA, was not used in the creation of this Dataset.

    • Only the fields related to town name, location, editor, edit date, and link fields associated with the towns’ CAMA were included in the creation of this dataset. Any other field provided in the original data was deleted or not used.

    • Field names for town (Muni, Municipality) were renamed to "Town Name".

    The attributes included in the data:

    • Town Name

    • Owner

    • Co-Owner

    • Link

    • Editor

    • Edit Date

    • Collection year – year the parcels were submitted

    • Location

    • Mailing Address

    • Mailing City

    • Mailing State

    • Assessed Total

    • Assessed Land

    • Assessed Building

    • Pre-Year Assessed Total

    • Appraised Land

    • Appraised Building

    • Appraised Outbuilding

    • Condition

    • Model

    • Valuation

    • Zone

    • State Use

    • State Use Description

    • Living Area

    • Effective Area

    • Total rooms

    • Number of bedrooms

    • Number of Baths

    • Number of Half-Baths

    • Sale Price

    • Sale Date

    • Qualified

    • Occupancy

    • Prior Sale Price

    • Prior Sale Date

    • Prior Book and Page

    • Planning Region

    *Please note that not all parcels have a link to a CAMA entry.

    *If any discrepancies are discovered within the data, whether pertaining to geographical inaccuracies or attribute inaccuracy, please directly contact the respective municipalities to request any necessary amendments

    As of 2/15/2023 - Occupancy, State Use, State Use Description, and Mailing State added to dataset

    Additional information about the specifics of data availability and compliance will be coming soon.

  13. CLM AWRA HRVs Uncertainty Analysis

    • researchdata.edu.au
    • data.gov.au
    • +1more
    Updated Jul 10, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bioregional Assessment Program (2017). CLM AWRA HRVs Uncertainty Analysis [Dataset]. https://researchdata.edu.au/clm-awra-hrvs-uncertainty-analysis/2984398
    Explore at:
    Dataset updated
    Jul 10, 2017
    Dataset provided by
    Data.govhttps://data.gov/
    Authors
    Bioregional Assessment Program
    Description

    Abstract

    This dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.

    This dataset contains the data and scripts to generate the hydrological response variables for surface water in the Clarence Moreton subregion as reported in CLM261 (Gilfedder et al. 2016).

    Dataset History

    File CLM_AWRA_HRVs_flowchart.png shows the different files in this dataset and how they interact. The python and R-scripts are written by the BA modelling team to, as detailed below, read, combine and analyse the source datasets CLM AWRA model, CLM groundwater model V1 and CLM16swg Surface water gauging station data within the Clarence Moreton Basin to create the hydrological response variables for surface water as reported in CLM2.6.1 (Gilfedder et al. 2016).

    R-script HRV_SWGW_CLM.R reads, for each model simulation, the outputs from the surface water model in netcdf format from file Qtot.nc (dataset CLM AWRA model) and the outputs from the groundwater model, flux_change.csv (dataset CLM groundwater model V1) and creates a set of files in subfolder /Output for each GaugeNr and simulation Year:

    CLM_GaugeNr_Year_all.csv and CLM_GaugeNR_Year_baseline.csv: the set of 9 HRVs for GaugeNr and Year for all 5000 simulations for baseline conditions

    CLM_GaugeNr_Year_CRDP.csv: the set of 9 HRVs for GaugeNr and Year for all 5000 simulations for CRDP conditions (=AWRA streamflow - MODFLOW change in SW-GW flux)

    CLM_GaugeNr_Year_minMax.csv: minimum and maximum of HRVs over all 5000 simulations

    Python script CLM_collate_DoE_Predictions.py collates that information into following files, for each HRV and each maxtype (absolute maximum (amax), relative maximum (pmax) and time of absolute maximum change (tmax)):

    CLM_AWRA_HRV_maxtyp_DoE_Predictions: for each simulation and each gauge_nr, the maxtyp of the HRV over the prediction period (2012 to 2102)

    CLM_AWRA_HRV_DoE_Observations: for each simulation and each gauge_nr, the HRV for the years that observations are available

    CLM_AWRA_HRV_Observations: summary statistics of each HRV and the observed value (based on data set CLM16swg Surface water gauging station data within the Clarence Moreton Basin)

    CLM_AWRA_HRV_maxtyp_Predictions: summary statistics of each HRV

    R-script CLM_CreateObjectiveFunction.R calculates for each HRV the objective function value for all simulations and stores it in CLM_AWRA_HRV_ss.csv. This file is used by python script CLM_AWRA_SI.py to generate figure CLM-2615-002-SI.png (sensitivity indices).

    The AWRA objective function is combined with the overall objective function from the groundwater model in dataset CLM Modflow Uncertainty Analysis (CLM_MF_DoE_ObjFun.csv) into csv file CLM_AWRA_HRV_oo.csv. This file is used to select behavioural simulations in python script CLM-2615-001-top10.py. This script uses files CLM_NodeOrder.csv and BA_Visualisation.py to create the figures CLM-2616-001-HRV_10pct.png.

    Dataset Citation

    Bioregional Assessment Programme (2016) CLM AWRA HRVs Uncertainty Analysis. Bioregional Assessment Derived Dataset. Viewed 28 September 2017, http://data.bioregionalassessments.gov.au/dataset/e51a513d-fde7-44ba-830c-07563a7b2402.

    Dataset Ancestors

  14. g

    Jenet Austin - Python workflow for seamlessly merging high resolution...

    • gimi9.com
    Updated Oct 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Jenet Austin - Python workflow for seamlessly merging high resolution digital elevation models | gimi9.com [Dataset]. https://gimi9.com/dataset/au_fedora-pid_csiro-65029/
    Explore at:
    Dataset updated
    Oct 1, 2025
    Description

    This set of python scripts and Jupyter notebooks constitutes a workflow for seamlessly merging multiple digital elevation models (DEMs) to produce a hydrologically robust high-resolution DEM for large river basins. The DEM merging method is adapted from Gallant, J.C. (2019) Merging lidar with coarser DEMs for hydrodynamic modelling over large areas, in: El Sawah, S. (Ed.) MODSIM2019, 23rd International Congress on Modelling and Simulation. Presented at the 23rd International Congress on Modelling and Simulation (MODSIM2019), Modelling and Simulation Society of Australia and New Zealand. https://mssanz.org.au/modsim2019/K24/gallant.pdf The workflow runs on the CSIRO EASI platform https://research.csiro.au/easi/ and expects data stored in an AWS s3 bucket. Dask is used for parallel processing. The workflow was built to merge all the available high-resolution DEMs for the Murray Darling Basin, Australia, using 852 individual lidar and photogrammetry DEMs from the Geoscience Australia elevation data portal Elvis https://elevation.fsdf.org.au/ and the Forests and Buildings removed DEM (FABDEM; Hawker et al. 2022- https://doi.org/10.1088/1748-9326/ac4d4f), a bare-earth radar-derived, 1 arc-second resolution global elevation model. The seamless composite high-resolution Murray Darling Basin DEM datasets (5 m and 25 m resolutions) produced with this workflow can be downloaded here https://doi.org/10.25919/e1z5-mx88. The workflow is divided into three parts: 1) Preprocessing, 2) DEM merging and 3) Postprocessing and validation. The Jupyter notebooks in the workflow are also provided in html format for initial access to the content, without needing a python kernel.

  15. n

    Research data underpinning "Investigating Reinforcement Learning Approaches...

    • data.ncl.ac.uk
    application/csv
    Updated Aug 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zheng Luo (2024). Research data underpinning "Investigating Reinforcement Learning Approaches In Stock Market Trading" [Dataset]. http://doi.org/10.25405/data.ncl.26539735.v1
    Explore at:
    application/csvAvailable download formats
    Dataset updated
    Aug 13, 2024
    Dataset provided by
    Newcastle University
    Authors
    Zheng Luo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The final dataset utilised for the publication "Investigating Reinforcement Learning Approaches In Stock Market Trading" was processed by downloading and combining data from multiple reputable sources to suit the specific needs of this project. Raw data were retrieved by downloading them using a Python finance API. Afterwards, Python and NumPy were used to combine and normalise the data to create the final dataset.The raw data was sourced as follows:Stock Prices of NVIDIA & AMD, Financial Indexes, and Commodity Prices: Retrieved from Yahoo Finance.Economic Indicators: Collected from the US Federal Reserve.The dataset was normalised to minute intervals, and the stock prices were adjusted to account for stock splits.This dataset was used for exploring the application of reinforcement learning in stock market trading. After creating the dataset, it was used in s reinforcement learning environment to train several reinforcement learning algorithms, including deep Q-learning, policy networks, policy networks with baselines, actor-critic methods, and time series incorporation. The performance of these algorithms was then compared based on profit made and other financial evaluation metrics, to investigate the application of reinforcement learning algorithms in stock market trading.The attached 'README.txt' contains methodological information and a glossary of all the variables in the .csv file.

  16. CIFAR-10 Python

    • kaggle.com
    zip
    Updated Jan 27, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kris (2018). CIFAR-10 Python [Dataset]. https://www.kaggle.com/datasets/pankrzysiu/cifar10-python/code
    Explore at:
    zip(340613496 bytes)Available download formats
    Dataset updated
    Jan 27, 2018
    Authors
    Kris
    Description

    Context

    CIFAR-10 is the excellent Dataset for many Image processing experiments.

    Content

    Usage instructions

    in Keras

    from os import listdir, makedirs
    from os.path import join, exists, expanduser
    
    cache_dir = expanduser(join('~', '.keras'))
    if not exists(cache_dir):
      makedirs(cache_dir)
    datasets_dir = join(cache_dir, 'datasets') # /cifar-10-batches-py
    if not exists(datasets_dir):
      makedirs(datasets_dir)
    
    # If you have multiple input datasets, change the below cp command accordingly, typically:
    # !cp ../input/cifar10-python/cifar-10-python.tar.gz ~/.keras/datasets/
    !cp ../input/cifar-10-python.tar.gz ~/.keras/datasets/
    !ln -s ~/.keras/datasets/cifar-10-python.tar.gz ~/.keras/datasets/cifar-10-batches-py.tar.gz
    !tar xzvf ~/.keras/datasets/cifar-10-python.tar.gz -C ~/.keras/datasets/
    

    general Python 3

    def unpickle(file):
      import pickle
      with open(file, 'rb') as fo:
        dict = pickle.load(fo, encoding='bytes')
      return dict
    
    !tar xzvf ../input/cifar-10-python.tar.gz
    

    then see section "Dataset layout" in https://www.cs.toronto.edu/~kriz/cifar.html for details

    Acknowledgements

    Downloaded directly from here:

    https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz

    See description: https://www.cs.toronto.edu/~kriz/cifar.html

    Inspiration

    Your data will be in front of the world's largest data science community. What questions do you want to see answered?

  17. u

    Data from: Cooperative Robotic Exploration of a Planetary Skylight Surface...

    • portaldelainvestigacion.uma.es
    • data-staging.niaid.nih.gov
    • +1more
    Updated 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Domínguez García-Escudero, Raúl; Germa, Thierry; Pérez del Pulgar Mancebo, Carlos Jesús; Domínguez García-Escudero, Raúl; Germa, Thierry; Pérez del Pulgar Mancebo, Carlos Jesús (2025). Cooperative Robotic Exploration of a Planetary Skylight Surface and Lava Cave - Datasets [Dataset]. https://portaldelainvestigacion.uma.es/documentos/67a9c7ce19544708f8c7316e?lang=ca
    Explore at:
    Dataset updated
    2025
    Authors
    Domínguez García-Escudero, Raúl; Germa, Thierry; Pérez del Pulgar Mancebo, Carlos Jesús; Domínguez García-Escudero, Raúl; Germa, Thierry; Pérez del Pulgar Mancebo, Carlos Jesús
    Description

    The dataset contains the logs used to produce the results described in the publication Cooperative Robotic Exploration of a Planetary Skylight Surface and Lava Cave. Raúl Domínguez et. al. 2025

    Cooperative Surface Exploration

    • CoRob_MP1_results.xlsx: Includes the log produced at the commanding station during the Mission Phase 1. It has been used to produce the results evaluation of the MP1.

    • cmap.ply: Resulting map of the MP1.

    • ground_truth_transformed_and_downsampled.ply: Ground truth map used for the evaluation of the cooperative map accuracy.

    Ground Truth Rover Logs

    The dataset contains the samples used to generate the map provided as ground truth for the cave in the publication Cooperative Robotic Exploration of a Planetary Skylight Surface and Lava Cave. Raúl Domínguez et. al. 2025

    The dataset has three parts. Between each of the parts, the data capture had to be interrupted. After each interruption, the position of the rover is not exactly the same as before the interruption. For that reason, it has been quite challenging to generate a full reconstruction using the three parts one after the other. In fact, the last one of the logs has not been filtered, since it was not possible to combine the different parts in a single SLAM reconstruction, the last part was not even pre-processed.

    Each log contains:- depthmaps, the raw LiDAR data from the Velodyne 32. Format: tiff.- filtered_cloud, the pre-processed LiDAR data from the Velodyne 32. Format: ply.- joint_states, the motor position values. Unfortunately the back axis passive joint is not included. Format: json.- orientation_samples, the orientation as provided by the IMU sensor. Format: json.

    • asguard_v4.urdf: In addition to the datasets, a geometrical robot model is provided which might be needed for environment reconstruction and pose estimation algorithms. Format: URDF.

    Folders contents

    ├── 20211117-1112│ ├── depth│ │ └── depth_1637143958347198│ ├── filtered_cloud│ │ └── cloud_1637143958347198│ ├── joint_states│ │ └── joints_state_1637143957824829│ └── orientation_samples│ └── orientation_sample_1637143958005814├── 20211117-1140│ ├── depth│ │ └── depth_1637145649108790│ ├── filtered_cloud│ │ └── cloud_1637145649108790│ ├── joint_states│ │ └── joints_state_1637145648630977│ └── orientation_samples│ └── orientation_sample_1637145648831795└── 20211117-1205 ├── depth │ └── depth_1637147164030135 ├── filtered_cloud │ └── cloud_1637147164330388 ├── joint_states │ └── joints_state_1637147163501574 └── orientation_samples └── orientation_sample_1637147163655187

    Cave reconstruction

    • first_log_2cm_res_pointcloud-20231222.ply, contains the integrated pointcloud produced from the first of the logs.

    Coyote 3 Logs

    The msgpack datasets can be imported using Python with the pocolog2msgpack library

    The geometrical rover model of Coyote 3 is included in URDF format. It can be used in environment reconstruction algorithms which require the positions of the different sensors.

    MP3

    Includes exports of the log files used to compute the KPIs of the MP3.

    MP4

    These logs were used to obtain the KPI values for the MP4. It is composed of the following archives:- log_coyote_02-03-2023_13-22_01-exp3.zip- log_coyote_02-03-2023_13-22_01-exp4.zip- log_coyote_02-09-2023_19-14_18_demo_skylight.zip- log_coyote_02-09-2023_19-14_20_demo_teleop.zip- coyote3_odometry_20230209-154158.0003_msgpacks.tar.gz- coyote3_odometry_20230203-125251.0819_msgpacks.tar.gz

    Cave PLYs

    Two integrated pointclouds and one trajectory produced from logs captured by Coyote 3 inside the cave:- Skylight_subsampled_mesh.ply- teleop_tunnel_pointcloud.ply- traj.ply

    Example scripts to load the datasets

    The repository https://github.com/Rauldg/corobx_dataset_scripts contains some example scripts which load some of the datasets.

  18. c

    Connecticut CAMA and Parcel Layer

    • geodata.ct.gov
    • data.ct.gov
    • +1more
    Updated Nov 20, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    State of Connecticut (2024). Connecticut CAMA and Parcel Layer [Dataset]. https://geodata.ct.gov/datasets/ctmaps::connecticut-cama-and-parcel-layer
    Explore at:
    Dataset updated
    Nov 20, 2024
    Dataset authored and provided by
    State of Connecticut
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Description

    Pursuant to Section 7-100l of the Connecticut General Statutes, each municipality is required to transmit a digital parcel file and an accompanying assessor’s database file (known as a CAMA report), to its respective regional council of governments (COG) by May 1 annually. The dataset has combined the Parcels and Computer-Assisted Mass Appraisal (CAMA) data for 2025 into a single dataset. This dataset is designed to make it easier for stakeholders and the GIS community to use and access the information as a geospatial dataset. Included in this dataset are geometries for all 169 municipalities and attribution from the CAMA data for all but one municipality. These data were gathered from the CT municipalities by the COGs and then submitted to CT OPM. This dataset was created on September 2025 from data collected in 2024-2025. Data was processed using Python scripts and ArcGIS Pro for standardization and integration of the data. To learn more about Parcel and CAMA in CT visit our Parcels Page in the Geodata Portal.Coordinate system: This dataset is provided in NAD 83 Connecticut State Plane (2011) (EPSG 2234) projection as it was for 2024. Prior versions were provided at WGS 1984 Web Mercator Auxiliary Sphere (EPSG 3857). Ownership Suppression: The updated dataset includes parcel data for all towns across the state, with some towns featuring fully suppressed ownership information. In these instances, the owner’s name was replaced with the label "Current Owner," the co-owner’s name will be listed as "Current Co-Owner," and the mailing address will appear as the property address itself. For towns with fully suppressed ownership data, please note that no "Suppression" field was included in the submission to confirm these details and this labeling approach was implemented as the solution.New Data Fields:The new dataset introduces the “Property Zip” and “Mailing Zip” fields, which will display the zip codes for the owner and property.Service URL:In 2024, we implemented a stable URL to maintain public access to the most up-to-date data layer. Users are strongly encouraged to transition to the new service as soon as possible to ensure uninterrupted workflows. This URL will remain persistent, providing long-term stability for your applications and integrations. Once you’ve transitioned to the new service, no further URL changes will be necessary.CAMA Notes:The CAMA underwent several steps to standardize and consolidate the information. Python scripts were used to concatenate fields and create a unique identifier for each entry. The resulting dataset contains 1,354,720 entries and information on property assessments and other relevant attributes.CAMA was provided by the towns.Spatial Data Notes:Data processing involved merging the parcels from different municipalities using ArcGIS Pro and Python. The resulting dataset contains 1,282,833 parcels.No alteration has been made to the spatial geometry of the data.Fields that are associated with CAMA data were provided by towns.The data fields that have information from the CAMA were sourced from the towns’ CAMA data.If no field for the parcels was provided for linking back to the CAMA by the town a new field within the original data was selected if it had a match rate above 50%, that joined back to the CAMA.Linking fields were renamed to "Link".All linking fields had a census town code added to the beginning of the value to create a unique identifier per town.Any field that was not town name, Location, Editor, Edit Date, or a field associated back to the CAMA, was not used in the creation of this Dataset.Only the fields related to town name, location, editor, edit date, and link fields associated with the towns’ CAMA were included in the creation of this dataset. Any other field provided in the original data was deleted or not used.Field names for town (Muni, Municipality) were renamed to "Town Name".Attributes included in the data: Town Name OwnerCo-OwnerLinkEditorEdit DateCollection year – year the parcels were submittedLocationProperty ZipMailing AddressMailing CityMailing StateMailing ZipAssessed TotalAssessed LandAssessed BuildingPre-Year Assessed Total Appraised LandAppraised BuildingAppraised OutbuildingConditionModelValuationZoneState UseState Use DescriptionLand Acre Living AreaEffective AreaTotal roomsNumber of bedroomsNumber of BathsNumber of Half-BathsSale PriceSale DateQualifiedOccupancyPrior Sale PricePrior Sale DatePrior Book and PagePlanning RegionFIPS Code *Please note that not all parcels have a link to a CAMA entry.*If any discrepancies are discovered within the data, whether pertaining to geographical inaccuracies or attribute inaccuracy, please directly contact the respective municipalities to request any necessary amendmentsAdditional information about the specifics of data availability and compliance will be coming soon.If you need a WFS service for use in specific applications : Please Click HereContact: opm.giso@ct.gov

  19. T

    covid19

    • tensorflow.org
    Updated Dec 6, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). covid19 [Dataset]. https://www.tensorflow.org/datasets/catalog/covid19
    Explore at:
    Dataset updated
    Dec 6, 2022
    Description

    This repository attempts to assemble the largest Covid-19 epidemiological database in addition to a powerful set of expansive covariates. It includes open, publicly sourced, licensed data relating to demographics, economy, epidemiology, geography, health, hospitalizations, mobility, government response, weather, and more.

    This particular dataset corresponds to a join of all the different tables that are part of the repository. Therefore, expect the resulting samples to be highly sparse.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('covid19', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  20. d

    Replication Data for Exploring an extinct society through the lens of...

    • dataone.org
    Updated Dec 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wieczorek, Oliver; Malzahn, Melanie (2023). Replication Data for Exploring an extinct society through the lens of Habitus-Field theory and the Tocharian text corpus [Dataset]. http://doi.org/10.7910/DVN/UF8DHK
    Explore at:
    Dataset updated
    Dec 16, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Wieczorek, Oliver; Malzahn, Melanie
    Description

    The files and workflow will allow you to replicate the study titled "Exploring an extinct society through the lens of Habitus-Field theory and the Tocharian text corpus". This study aimed at utilizing the CEToM-corpus (https://cetom.univie.ac.at/) (Tocharian) to analyze the life-world of the elites of an extinct society situated in modern eastern China. To acquire the raw data needed for steps 1 & 2, please contact Melanie Malzahn melanie.malzahn@univie.ac.at. We conducted a mixed methods study, containing of close reading, content analysis, and multiple correspondence analysis (MCA). The excel file titled "fragments_architecture_combined.xlsx" allows for replication of the MCA and equates to the third step of the workflow outlined below. We used the following programming languages and packages to prepare the dataset and to analyze the data. Data preparation and merging procedures were achieved in python (version 3.9.10) with packages pandas (version 1.5.3), os (version 3.12.0), re (version 3.12.0), numpy (version 1.24.3), gensim (version 4.3.1), BeautifulSoup4 (version 4.12.2), pyasn1 (version 0.4.8), and langdetect (version 1.0.9). Multiple Correspondence Analyses were conducted in R (version 4.3.2) with the packages FactoMineR (version 2.9), factoextra (version 1.0.7), readxl version(1.4.3), tidyverse version(2.0.0), ggplot2 (version 3.4.4) and psych (version 2.3.9). After requesting the necessary files, please open the scripts in the order outlined bellow and execute the code-files to replicate the analysis: Preparatory step: Create a folder for the python and r-scripts downloadable in this repository. Open the file 0_create folders.py and declare a root folder in line 19. This first script will generate you the following folders: "tarim-brahmi_database" = Folder, which contains tocharian dictionaries and tocharian text fragments. "dictionaries" = contains tocharian A and tocharian B vocabularies, including linguistic features such as translations, meanings, part of speech tags etc. A full overview of the words is provided on https://cetom.univie.ac.at/?words. "fragments" = contains tocharian text fragments as xml-files. "word_corpus_data" = folder will contain excel-files of the corpus data after the first step. "Architectural_terms" = This folder contains the data on the architectural terms used in the dataset (e.g. dwelling, house). "regional_data" = This folder contains the data on the findsports (tocharian and modern chinese equivalent, e.g. Duldur-Akhur & Kucha). "mca_ready_data" = This is the folder, in which the excel-file with the merged data will be saved. Note that the prepared file named "fragments_architecture_combined.xlsx" can be saved into this directory. This allows you to skip steps 1 &2 and reproduce the MCA of the content analysis based on the third step of our workflow (R-Script titled 3_conduct_MCA.R). First step - run 1_read_xml-files.py: loops over the xml-files in folder dictionaries and identifies a) word metadata, including language (Tocharian A or B), keywords, part of speech, lemmata, word etymology, and loan sources. Then, it loops over the xml-textfiles and extracts a text id number, langauge (Tocharian A or B), text title, text genre, text subgenre, prose type, verse type, material on which the text is written, medium, findspot, the source text in tocharian, and the translation where available. After successful feature extraction, the resulting pandas dataframe object is exported to the word_corpus_data folder. Second step - run 2_merge_excel_files.py: merges all excel files (corpus, data on findspots, word data) and reproduces the content analysis, which was based upon close reading in the first place. Third step - run 3_conduct_MCA.R: recodes, prepares, and selects the variables necessary to conduct the MCA. Then produces the descriptive values, before conducitng the MCA, identifying typical texts per dimension, and exporting the png-files uploaded to this repository.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Mr.Machine (2022). Data Pre-Processing : Data Integration [Dataset]. https://www.kaggle.com/ilayaraja07/data-preprocessing-data-integration
Organization logo

Data Pre-Processing : Data Integration

Merge - Join - Concatenate

Explore at:
37 scholarly articles cite this dataset (View in Google Scholar)
zip(2327 bytes)Available download formats
Dataset updated
Aug 2, 2022
Authors
Mr.Machine
Description

In this exercise, we'll merge the details of students from two datasets, namely student.csv and marks.csv. The student dataset contains columns such as Age, Gender, Grade, and Employed. The marks.csv dataset contains columns such as Mark and City. The Student_id column is common between the two datasets. Follow these steps to complete this exercise

Search
Clear search
Close search
Google apps
Main menu