43 datasets found
  1. Data Pre-Processing : Data Integration

    • kaggle.com
    Updated Aug 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mr.Machine (2022). Data Pre-Processing : Data Integration [Dataset]. https://www.kaggle.com/datasets/ilayaraja07/data-preprocessing-data-integration
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 2, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Mr.Machine
    Description

    In this exercise, we'll merge the details of students from two datasets, namely student.csv and marks.csv. The student dataset contains columns such as Age, Gender, Grade, and Employed. The marks.csv dataset contains columns such as Mark and City. The Student_id column is common between the two datasets. Follow these steps to complete this exercise

  2. T

    imagenet2012_subset

    • tensorflow.org
    Updated Oct 21, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). imagenet2012_subset [Dataset]. https://www.tensorflow.org/datasets/catalog/imagenet2012_subset
    Explore at:
    Dataset updated
    Oct 21, 2024
    Description

    ILSVRC 2012, commonly known as 'ImageNet' is an image dataset organized according to the WordNet hierarchy. Each meaningful concept in WordNet, possibly described by multiple words or word phrases, is called a "synonym set" or "synset". There are more than 100,000 synsets in WordNet, majority of them are nouns (80,000+). In ImageNet, we aim to provide on average 1000 images to illustrate each synset. Images of each concept are quality-controlled and human-annotated. In its completion, we hope ImageNet will offer tens of millions of cleanly sorted images for most of the concepts in the WordNet hierarchy.

    The test split contains 100K images but no labels because no labels have been publicly released. We provide support for the test split from 2012 with the minor patch released on October 10, 2019. In order to manually download this data, a user must perform the following operations:

    1. Download the 2012 test split available here.
    2. Download the October 10, 2019 patch. There is a Google Drive link to the patch provided on the same page.
    3. Combine the two tar-balls, manually overwriting any images in the original archive with images from the patch. According to the instructions on image-net.org, this procedure overwrites just a few images.

    The resulting tar-ball may then be processed by TFDS.

    To assess the accuracy of a model on the ImageNet test split, one must run inference on all images in the split, export those results to a text file that must be uploaded to the ImageNet evaluation server. The maintainers of the ImageNet evaluation server permits a single user to submit up to 2 submissions per week in order to prevent overfitting.

    To evaluate the accuracy on the test split, one must first create an account at image-net.org. This account must be approved by the site administrator. After the account is created, one can submit the results to the test server at https://image-net.org/challenges/LSVRC/eval_server.php The submission consists of several ASCII text files corresponding to multiple tasks. The task of interest is "Classification submission (top-5 cls error)". A sample of an exported text file looks like the following:

    771 778 794 387 650
    363 691 764 923 427
    737 369 430 531 124
    755 930 755 59 168
    

    The export format is described in full in "readme.txt" within the 2013 development kit available here: https://image-net.org/data/ILSVRC/2013/ILSVRC2013_devkit.tgz Please see the section entitled "3.3 CLS-LOC submission format". Briefly, the format of the text file is 100,000 lines corresponding to each image in the test split. Each line of integers correspond to the rank-ordered, top 5 predictions for each test image. The integers are 1-indexed corresponding to the line number in the corresponding labels file. See labels.txt.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('imagenet2012_subset', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

    https://storage.googleapis.com/tfds-data/visualization/fig/imagenet2012_subset-1pct-5.0.0.png" alt="Visualization" width="500px">

  3. c

    Combined wildfire datasets for the United States and certain territories,...

    • s.cnmilf.com
    • data.usgs.gov
    • +1more
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Combined wildfire datasets for the United States and certain territories, 1800s-Present (combined wildland fire polygons) [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/combined-wildfire-datasets-for-the-united-states-and-certain-territories-1800s-present-com
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    U.S. Geological Survey
    Area covered
    United States
    Description

    First, we would like to thank the wildland fire advisory group. Their wisdom and guidance helped us build the dataset as it currently exists. Currently, there are multiple, freely available fire datasets that identify wildfire and prescribed fire burned areas across the United States. However, these datasets are all limited in some way. Their time periods could cover only a couple of decades or they may have stopped collecting data many years ago. Their spatial footprints may be limited to a specific geographic area or agency. Their attribute data may be limited to nothing more than a polygon and a year. None of the existing datasets provides a comprehensive picture of fires that have burned throughout the last few centuries. Our dataset uses these existing layers and utilizes a series of both manual processes and ArcGIS Python (arcpy) scripts to merge these existing datasets into a single dataset that encompasses the known wildfires and prescribed fires within the United States and certain territories. Forty different fire layers were utilized in this dataset. First, these datasets were ranked by order of observed quality (Tiers). The datasets were given a common set of attribute fields and as many of these fields were populated as possible within each dataset. All fire layers were then merged together (the merged dataset) by their common attributes to created a merged dataset containing all fire polygons. Polygons were then processed in order of Tier (1-8) so that overlapping polygons in the same year and Tier were dissolved together. Overlapping polygons in subsequent Tiers were removed from the dataset. Attributes from the original datasets of all intersecting polygons in the same year across all Tiers were also merged so that all attributes from all Tiers were included, but only the polygons from the highest ranking Tier were dissolved to form the fire polygon. The resulting product (the combined dataset) has only one fire per year in a given area with one set of attributes. While it combines wildfire data from 40 wildfire layers and therefore has more complete information on wildfires than the datasets that went into it, this dataset has also has its own set of limitations. Please see the Data Quality attributes within the metadata record for additional information on this dataset's limitations. Overall, we believe this dataset is designed be to a comprehensive collection of fire boundaries within the United States and provides a more thorough and complete picture of fires across the United States when compared to the datasets that went into it.

  4. d

    Connecticut State Parcel Layer 2023

    • catalog.data.gov
    • data.ct.gov
    • +2more
    Updated May 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    State of Connecticut (2025). Connecticut State Parcel Layer 2023 [Dataset]. https://catalog.data.gov/dataset/connecticut-state-parcel-layer-2023-74a65
    Explore at:
    Dataset updated
    May 10, 2025
    Dataset provided by
    State of Connecticut
    Area covered
    Connecticut
    Description

    The dataset has combined the Parcels and Computer-Assisted Mass Appraisal (CAMA) data for 2023 into a single dataset. This dataset is designed to make it easier for stakeholders and the GIS community to use and access the information as a geospatial dataset. Included in this dataset are geometries for all 169 municipalities and attribution from the CAMA data for all but one municipality. Pursuant to Section 7-100l of the Connecticut General Statutes, each municipality is required to transmit a digital parcel file and an accompanying assessor’s database file (known as a CAMA report), to its respective regional council of governments (COG) by May 1 annually. These data were gathered from the CT municipalities by the COGs and then submitted to CT OPM. This dataset was created on 12/08/2023 from data collected in 2022-2023. Data was processed using Python scripts and ArcGIS Pro, ensuring standardization and integration of the data.CAMA Notes:The CAMA underwent several steps to standardize and consolidate the information. Python scripts were used to concatenate fields and create a unique identifier for each entry. The resulting dataset contains 1,353,595 entries and information on property assessments and other relevant attributes.CAMA was provided by the towns.Canaan parcels are viewable, but no additional information is available since no CAMA data was submitted.Spatial Data Notes:Data processing involved merging the parcels from different municipalities using ArcGIS Pro and Python. The resulting dataset contains 1,247,506 parcels.No alteration has been made to the spatial geometry of the data.Fields that are associated with CAMA data were provided by towns.The data fields that have information from the CAMA were sourced from the towns’ CAMA data.If no field for the parcels was provided for linking back to the CAMA by the town a new field within the original data was selected if it had a match rate above 50%, that joined back to the CAMA.Linking fields were renamed to "Link".All linking fields had a census town code added to the beginning of the value to create a unique identifier per town.Any field that was not town name, Location, Editor, Edit Date, or a field associated back to the CAMA, was not used in the creation of this Dataset.Only the fields related to town name, location, editor, edit date, and link fields associated with the towns’ CAMA were included in the creation of this dataset. Any other field provided in the original data was deleted or not used.Field names for town (Muni, Municipality) were renamed to "Town Name".

  5. d

    CLM AWRA HRVs Uncertainty Analysis

    • data.gov.au
    • researchdata.edu.au
    • +1more
    Updated Nov 19, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bioregional Assessment Program (2019). CLM AWRA HRVs Uncertainty Analysis [Dataset]. https://data.gov.au/data/dataset/e51a513d-fde7-44ba-830c-07563a7b2402
    Explore at:
    Dataset updated
    Nov 19, 2019
    Dataset provided by
    Bioregional Assessment Program
    Description

    Abstract

    This dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.

    This dataset contains the data and scripts to generate the hydrological response variables for surface water in the Clarence Moreton subregion as reported in CLM261 (Gilfedder et al. 2016).

    Dataset History

    File CLM_AWRA_HRVs_flowchart.png shows the different files in this dataset and how they interact. The python and R-scripts are written by the BA modelling team to, as detailed below, read, combine and analyse the source datasets CLM AWRA model, CLM groundwater model V1 and CLM16swg Surface water gauging station data within the Clarence Moreton Basin to create the hydrological response variables for surface water as reported in CLM2.6.1 (Gilfedder et al. 2016).

    R-script HRV_SWGW_CLM.R reads, for each model simulation, the outputs from the surface water model in netcdf format from file Qtot.nc (dataset CLM AWRA model) and the outputs from the groundwater model, flux_change.csv (dataset CLM groundwater model V1) and creates a set of files in subfolder /Output for each GaugeNr and simulation Year:

    CLM_GaugeNr_Year_all.csv and CLM_GaugeNR_Year_baseline.csv: the set of 9 HRVs for GaugeNr and Year for all 5000 simulations for baseline conditions

    CLM_GaugeNr_Year_CRDP.csv: the set of 9 HRVs for GaugeNr and Year for all 5000 simulations for CRDP conditions (=AWRA streamflow - MODFLOW change in SW-GW flux)

    CLM_GaugeNr_Year_minMax.csv: minimum and maximum of HRVs over all 5000 simulations

    Python script CLM_collate_DoE_Predictions.py collates that information into following files, for each HRV and each maxtype (absolute maximum (amax), relative maximum (pmax) and time of absolute maximum change (tmax)):

    CLM_AWRA_HRV_maxtyp_DoE_Predictions: for each simulation and each gauge_nr, the maxtyp of the HRV over the prediction period (2012 to 2102)

    CLM_AWRA_HRV_DoE_Observations: for each simulation and each gauge_nr, the HRV for the years that observations are available

    CLM_AWRA_HRV_Observations: summary statistics of each HRV and the observed value (based on data set CLM16swg Surface water gauging station data within the Clarence Moreton Basin)

    CLM_AWRA_HRV_maxtyp_Predictions: summary statistics of each HRV

    R-script CLM_CreateObjectiveFunction.R calculates for each HRV the objective function value for all simulations and stores it in CLM_AWRA_HRV_ss.csv. This file is used by python script CLM_AWRA_SI.py to generate figure CLM-2615-002-SI.png (sensitivity indices).

    The AWRA objective function is combined with the overall objective function from the groundwater model in dataset CLM Modflow Uncertainty Analysis (CLM_MF_DoE_ObjFun.csv) into csv file CLM_AWRA_HRV_oo.csv. This file is used to select behavioural simulations in python script CLM-2615-001-top10.py. This script uses files CLM_NodeOrder.csv and BA_Visualisation.py to create the figures CLM-2616-001-HRV_10pct.png.

    Dataset Citation

    Bioregional Assessment Programme (2016) CLM AWRA HRVs Uncertainty Analysis. Bioregional Assessment Derived Dataset. Viewed 28 September 2017, http://data.bioregionalassessments.gov.au/dataset/e51a513d-fde7-44ba-830c-07563a7b2402.

    Dataset Ancestors

  6. d

    Hydrologic Unit (HUC8) Boundaries for Alaskan Watersheds

    • dataone.org
    • knb.ecoinformatics.org
    • +1more
    Updated Mar 20, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jared Kibele (2019). Hydrologic Unit (HUC8) Boundaries for Alaskan Watersheds [Dataset]. http://doi.org/10.5063/F14T6GM3
    Explore at:
    Dataset updated
    Mar 20, 2019
    Dataset provided by
    Knowledge Network for Biocomplexity
    Authors
    Jared Kibele
    Time period covered
    Jan 19, 2018
    Area covered
    Variables measured
    name, source, id_numeric, id_original
    Description

    The United States is divided and sub-divided into successively smaller hydrologic units which are classified into four levels: regions, sub-regions, accounting units, and cataloging units. The hydrologic units are arranged or nested within each other, from the largest geographic area (regions) to the smallest geographic area (cataloging units). Each hydrologic unit is identified by a unique hydrologic unit code (HUC) consisting of two to eight digits based on the four levels of classification in the hydrologic unit system. A shapefile (or geodatabase) of watersheds for the state of Alaska and parts of western Canada was created by merging two datasets: the U.S. Watershed Boundary Dataset (WBD) and the Government of Canada's National Hydro Network (NHN). Since many rivers in Alaska are transboundary, the NHN data is necessary to capture their watersheds. The WBD data can be found at https://catalog.data.gov/dataset/usgs-national-watershed-boundary-dataset-wbd-downloadable-data-collection-national-geospatial- and the NHN data can be found here: https://open.canada.ca/data/en/dataset/a4b190fe-e090-4e6d-881e-b87956c07977. The included python script was used to subset and merge the two datasets into the single dataset, archived here.

  7. Z

    Data from: Enhancing Open Modification Searches via a Combined Approach...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Dec 2, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mechthild Pohlschröder (2020). Enhancing Open Modification Searches via a Combined Approach Facilitated by Ursgal [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4299357
    Explore at:
    Dataset updated
    Dec 2, 2020
    Dataset provided by
    Stefan Schulze
    Christian Fufezan
    Sebastian A. Leidel
    Aime Bienfait Igiraneza
    Manuel Kösters
    Mechthild Pohlschröder
    Johannes Leufken
    Benjamin A. Garcia
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The identification of peptide sequences and their post-translational modifications (PTMs) is a crucial step in the analysis of bottom-up proteomics data. The recent development of open modification search (OMS) engines allows virtually all PTMs to be searched for. This not only increases the number of spectra that can be matched to peptides but also greatly advances the understanding of biological roles of PTMs through the identification, and thereby facilitated quantification, of peptidoforms (peptide sequences and their potential PTMs). While the benefits of combining results from multiple protein database search engines has been established previously, similar approaches for OMS results are missing so far. Here, we compare and combine results from three different OMS engines, demonstrating an increase in peptide spectrum matches of 8-18%. The unification of search results furthermore allows for the combined downstream processing of search results, including the mapping to potential PTMs. Finally, we test for the ability of OMS engines to identify glycosylated peptides. The implementation of these engines in the Python framework Ursgal facilitates the straightforward application of OMS with unified parameters and results files, thereby enabling yet unmatched high-throughput, large-scale data analysis.

    This dataset includes all relevant results files, databases, and scripts that correspond to the accompanying journal article. Specifically, the following files are deposited:

    Homo_sapiens_PXD004452_results.zip: result files from OMS and CS for the dataset PXD004452

    Homo_sapiens_PXD013715_results.zip: result files from OMS and CS for the dataset PXD013715

    Haloferax_volcanii_PXD021874_results.zip: result files from OMS and CS for the dataset PXD021874

    Escherichia_coli_PXD000498_results.zip: result files from OMS and CS for the dataset PXD000498

    databases.zip: target-decoy databases for Homo sapiens, Escherichia coli and Haloferax volcanii as well as a glycan database for Homo sapiens

    scripts.zip: example scripts for all relevant steps of the analysis

    mzml_files.zip: mzML files for all included datasets

    ursgal.zip: current version of Ursgal (0.6.7) that has been used to generate the results (for most recent versions see https://github.com/ursgal/ursgal)

  8. e

    Key Characteristics of Algorithms' Dynamics Beyond Accuracy - Evaluation...

    • b2find.eudat.eu
    Updated Jul 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Key Characteristics of Algorithms' Dynamics Beyond Accuracy - Evaluation Tests (v2) - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/3524622d-2099-554c-826a-f2155c3f4bb4
    Explore at:
    Dataset updated
    Jul 31, 2025
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Key Characteristics of Algorithms' Dynamics Beyond Accuracy - Evaluation Tests (v2) conducted for the paper: What do anomaly scores actually mean? Key characteristics of algorithms' dynamics beyond accuracy by F. Iglesias, H. O. Marques, A. Zimek, T. Zseby Context and methodology Anomaly detection is intrinsic to a large number of data analysis applications today. Most of the algorithms used assign an outlierness score to each instance prior to establishing anomalies in a binary form. The experiments in this repository study how different algorithms generate different dynamics in the outlierness scores and react in very different ways to possible model perturbations that affect data. The study elaborated in the referred paper presents new indices and coefficients to assess the dynamics and explores the responses of the algorithms as a function of variations in these indices, revealing key aspects of the interdependence between algorithms, data geometries and the ability to discriminate anomalies. Therefeore, this repository reproduces the conducted experiments, which study eight algorithms (ABOD, HBOS, iForest, K-NN, LOF, OCSVM, SDO and GLOSH), submitted to seven perturbations related to: cardinality, dimensionality, outlier proportion, inlier-outlier density ratio, density layers, clusters and local outliers, and collects behavioural profiles with eleven measurements (Adjusted Average Precission, ROC-AUC, Perini's Confidence [1], Perini's Stability [2], S-curves, Discriminant Power, Robust Coefficients of Variations for Inliers and Outliers, Coherence, Bias and Robustness) under two types of normalization: linear and Gaussian, the latter aiming to standardize the outlierness scores issued by different algorithms [3]. This repository is framed within the research on the following domains: algorithm evaluation, outlier detection, anomaly detection, unsupervised learning, machine learning, data mining, data analysis. Datasets and algorithms can be used for experiment replication and for further evaluation and comparison. References [1] Perini, L., Vercruyssen, V., Davis, J.: Quantifying the confidence of anomaly detectors in their example-wise predictions. In: The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. Springer Verlag (2020). [2] Perini, L., Galvin, C., Vercruyssen, V.: A Ranking Stability Measure for Quantifying the Robustness of Anomaly Detection Methods. In: 2nd Workshop on Evaluation and Experimental Design in Data Mining and Machine Learning @ ECML/PKDD (2020). [3] Kriegel, H.-P., Kröger, P., Schubert, E., Zimek, A.: Interpreting and unifying outlier scores. In: Proceedings of the 2011 SIAM International Conference on Data Mining (SDM), pp. 13–24 (2011) Technical details Experiments are tested Python 3.9.6. Provided scripts generate all synthetic data and results. We keep them in the repo for the sake of comparability and replicability ("outputs.zip" file). The file and folder structure is as follows: "compare_scores_group.py" is a Python script to extract new dynamic indices proposed in the paper. "generate_data.py" is a Python script to generate datasets used for evaluation. "latex_table.py" is a Python script to show results in a latex-table format. "merge_indices.py" is a Python script to merge accuracy and dynamic indices in the same table-structured summary. "metric_corr.py" is a Python script to calculate correlation estimations between indices. "outdet.py" is a Python script that runs outlier detection with different algorithms on diverse datasets. "perini_tests.py" is a Python script to run Perini's confidence and stability on all datasets and algorithms' performances. "scatterplots.py" is a Python script that generates scatter plots for comparing accuracy and dynamic performances. "README.md" provides explanations and step by step instructions for replication. "requirements.txt" contains references to required Python libraries and versions. "outputs.zip" contains all result tables, plots and synthetic data generated with the scripts. [data/real_data] contain CSV versions of the Wilt, Shuttle, Waveform and Cardiotocography datasets (inherited and adapted from the LMU repository) License The CC-BY license applies to all data generated with the "generated_data.py" script. All distributed code is under the GNU GPL license. For the "ExCeeD.py" and "stability.py" scripts, please consult and refer to the original sources provided above.

  9. e

    Hydrologic Unit (HUC8) Boundaries for Alaskan Watersheds

    • knb.ecoinformatics.org
    Updated Jan 4, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jared Kibele (2019). Hydrologic Unit (HUC8) Boundaries for Alaskan Watersheds [Dataset]. http://doi.org/10.5063/F1M043MV
    Explore at:
    Dataset updated
    Jan 4, 2019
    Dataset provided by
    Knowledge Network for Biocomplexity
    Authors
    Jared Kibele
    Time period covered
    Jan 19, 2018
    Area covered
    Variables measured
    name, source, id_numeric, id_original
    Description

    The United States is divided and sub-divided into successively smaller hydrologic units which are classified into four levels: regions, sub-regions, accounting units, and cataloging units. The hydrologic units are arranged or nested within each other, from the largest geographic area (regions) to the smallest geographic area (cataloging units). Each hydrologic unit is identified by a unique hydrologic unit code (HUC) consisting of two to eight digits based on the four levels of classification in the hydrologic unit system. A shapefile (or geodatabase) of watersheds for the state of Alaska and parts of western Canada was created by merging two datasets: the U.S. Watershed Boundary Dataset (WBD) and the Government of Canada's National Hydro Network (NHN). Since many rivers in Alaska are transboundary, the NHN data is necessary to capture their watersheds. The WBD data can be found at https://catalog.data.gov/dataset/usgs-national-watershed-boundary-dataset-wbd-downloadable-data-collection-national-geospatial- and the NHN data can be found here: https://open.canada.ca/data/en/dataset/a4b190fe-e090-4e6d-881e-b87956c07977. The included python script was used to subset and merge the two datasets into the single dataset, archived here.

  10. u

    Cooperative Robotic Exploration of a Planetary Skylight Surface and Lava...

    • portaldelainvestigacion.uma.es
    Updated 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Domínguez García-Escudero, Raúl; Germa, Thierry; Pérez del Pulgar Mancebo, Carlos Jesús; Domínguez García-Escudero, Raúl; Germa, Thierry; Pérez del Pulgar Mancebo, Carlos Jesús (2025). Cooperative Robotic Exploration of a Planetary Skylight Surface and Lava Cave - Datasets [Dataset]. https://portaldelainvestigacion.uma.es/documentos/67a9c7ce19544708f8c7316e
    Explore at:
    Dataset updated
    2025
    Authors
    Domínguez García-Escudero, Raúl; Germa, Thierry; Pérez del Pulgar Mancebo, Carlos Jesús; Domínguez García-Escudero, Raúl; Germa, Thierry; Pérez del Pulgar Mancebo, Carlos Jesús
    Description

    The dataset contains the logs used to produce the results described in the publication Cooperative Robotic Exploration of a Planetary Skylight Surface and Lava Cave. Raúl Domínguez et. al. 2025

    Cooperative Surface Exploration

    • CoRob_MP1_results.xlsx: Includes the log produced at the commanding station during the Mission Phase 1. It has been used to produce the results evaluation of the MP1.

    • cmap.ply: Resulting map of the MP1.

    • ground_truth_transformed_and_downsampled.ply: Ground truth map used for the evaluation of the cooperative map accuracy.

    Ground Truth Rover Logs

    The dataset contains the samples used to generate the map provided as ground truth for the cave in the publication Cooperative Robotic Exploration of a Planetary Skylight Surface and Lava Cave. Raúl Domínguez et. al. 2025

    The dataset has three parts. Between each of the parts, the data capture had to be interrupted. After each interruption, the position of the rover is not exactly the same as before the interruption. For that reason, it has been quite challenging to generate a full reconstruction using the three parts one after the other. In fact, the last one of the logs has not been filtered, since it was not possible to combine the different parts in a single SLAM reconstruction, the last part was not even pre-processed.

    Each log contains:- depthmaps, the raw LiDAR data from the Velodyne 32. Format: tiff.- filtered_cloud, the pre-processed LiDAR data from the Velodyne 32. Format: ply.- joint_states, the motor position values. Unfortunately the back axis passive joint is not included. Format: json.- orientation_samples, the orientation as provided by the IMU sensor. Format: json.

    • asguard_v4.urdf: In addition to the datasets, a geometrical robot model is provided which might be needed for environment reconstruction and pose estimation algorithms. Format: URDF.

    Folders contents

    ├── 20211117-1112│ ├── depth│ │ └── depth_1637143958347198│ ├── filtered_cloud│ │ └── cloud_1637143958347198│ ├── joint_states│ │ └── joints_state_1637143957824829│ └── orientation_samples│ └── orientation_sample_1637143958005814├── 20211117-1140│ ├── depth│ │ └── depth_1637145649108790│ ├── filtered_cloud│ │ └── cloud_1637145649108790│ ├── joint_states│ │ └── joints_state_1637145648630977│ └── orientation_samples│ └── orientation_sample_1637145648831795└── 20211117-1205 ├── depth │ └── depth_1637147164030135 ├── filtered_cloud │ └── cloud_1637147164330388 ├── joint_states │ └── joints_state_1637147163501574 └── orientation_samples └── orientation_sample_1637147163655187

    Cave reconstruction

    • first_log_2cm_res_pointcloud-20231222.ply, contains the integrated pointcloud produced from the first of the logs.

    Coyote 3 Logs

    The msgpack datasets can be imported using Python with the pocolog2msgpack library

    The geometrical rover model of Coyote 3 is included in URDF format. It can be used in environment reconstruction algorithms which require the positions of the different sensors.

    MP3

    Includes exports of the log files used to compute the KPIs of the MP3.

    MP4

    These logs were used to obtain the KPI values for the MP4. It is composed of the following archives:- log_coyote_02-03-2023_13-22_01-exp3.zip- log_coyote_02-03-2023_13-22_01-exp4.zip- log_coyote_02-09-2023_19-14_18_demo_skylight.zip- log_coyote_02-09-2023_19-14_20_demo_teleop.zip- coyote3_odometry_20230209-154158.0003_msgpacks.tar.gz- coyote3_odometry_20230203-125251.0819_msgpacks.tar.gz

    Cave PLYs

    Two integrated pointclouds and one trajectory produced from logs captured by Coyote 3 inside the cave:- Skylight_subsampled_mesh.ply- teleop_tunnel_pointcloud.ply- traj.ply

    Example scripts to load the datasets

    The repository https://github.com/Rauldg/corobx_dataset_scripts contains some example scripts which load some of the datasets.

  11. f

    Features of probabilistic linkage solutions available for record linkage...

    • plos.figshare.com
    • figshare.com
    xls
    Updated Oct 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John Prindle; Himal Suthar; Emily Putnam-Hornstein (2023). Features of probabilistic linkage solutions available for record linkage applications. [Dataset]. http://doi.org/10.1371/journal.pone.0291581.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 20, 2023
    Dataset provided by
    PLOS ONE
    Authors
    John Prindle; Himal Suthar; Emily Putnam-Hornstein
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Features of probabilistic linkage solutions available for record linkage applications.

  12. Datasets and software for article "Predicting coarse-grained representations...

    • zenodo.org
    application/gzip, zip
    Updated Jan 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arnaud Belcour; Arnaud Belcour; Hidde de Jong; Hidde de Jong; Delphine Ropers; Delphine Ropers (2025). Datasets and software for article "Predicting coarse-grained representations of biogeochemical cycles from metabarcoding data" [Dataset]. http://doi.org/10.5281/zenodo.14762347
    Explore at:
    application/gzip, zipAvailable download formats
    Dataset updated
    Jan 30, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Arnaud Belcour; Arnaud Belcour; Hidde de Jong; Hidde de Jong; Delphine Ropers; Delphine Ropers
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Datasets for Tabigecy article

    This repository contains additional information for the article "Predicting coarse-grained representations of biogeochemical cycles from metabarcoding data". This information allows the results presented in the article to be reproduced:

    - article_data: this zip archive contains:

    • input files used for the article experiments:
      • bordenave_et_al_2013.tsv, bordenave_et_al_2013_abundance.csv and bordenave_et_al_2013_group.tsv for the Bordenave et al. dataset.
      • schwab_et_al_2022.tsv, schwab_et_al_2022_abundance.tsv and schwab_et_al_2022_sample_grouping.tsv for the Schwab et al. dataset.
    • output folders from Tabigecy run on these inputs:
      • output_bordenave: output folder for Bordenave et al. dataset.
      • output_schwab: output folder for Schwab et al. dataset.
    • scripts used to create plots:
      • create_pca.R to create PCA biplot and correlation plot from the results of both datasets.
      • bordenave_create_figure_article.py to create polar plots for the Bordenave et al. dataset.
      • schwab_create_figure_article.py to create polar plots for the Schwab et al. dataset.
    • several folders containing svg files for the article figures: bordenave_figure, experiment_figure, schwab_figure and workflow_figure.
    • original_data: original input files for both datasets.

    - input_files_esmecata_precomputed_db.zip: six input files created by SPARQL queries on UniProt to extract all taxa associated with species, genus, family, order, class and phylum. They were created using a script available in EsMeCaTa repository: esmecata/precomputed/create_input_precomputation.py. These files were used as input to esmecata proteomes to create the precomputed database.

    - database_proteomes_folder.zip: compressed archive containing the proteomes retrieved by EsMeCaTa for species, genus, family, order, class and phylum to create the EsMeCaTa precomputed database version 1.0.0. It is the result of combining the different runs of esmecata proteomes on the 6 taxonomic ranks for all the associated taxa of UniProt. From this folder, the precomputed database has been created by means of the following commands:

    Clustering of the proteomes:

    esmecata clustering -i database_proteomes_folder -o database_output_clustering -c 32 --remove-tmp

    Annotation of the consensus proteomes:

    esmecata annotation -i database_output_clustering -o database_output_annotation -e /path/to/eggnog/database -c 32

    Merging results from the three folders into the precomputed database:

    esmecata_create_db from_runs -iproteomes database_proteomes_folder -iclustering database_output_clustering -iannotation database_output_annotation -o esmecata_precomputed_database --db-version "1.0 -c 10

    - software.zip: compressed archive containing the code of the tools developed and used in the article:

    • bigecyhmm-0.1.5.zip: contains the code of bigecyhmm version 0.1.5 used in the article.
    • esmecata-0.6.0.zip: contains the code of esmecata version 0.6.0 used in the article.
    • tabigecy-0.1.1.zip: contains the code of tabigecy version 0.1.1 used in the article.

    - taxdmp_2024-10-01.tar.gz: The version of the NCBI Taxonomy database used in the article. To use this version of the database with EsMeCaTa, you have to import it with ete3 using the following command:

    python3 -c "from ete3 import NCBITaxa; ncbi = NCBITaxa(); ncbi.update_taxonomy_database('taxdmp_2024-10-01.tar.gz')"

    Perform experiments

    Analyses performed in the article can be reproduced by running tabigecy on the input files of the article_data archive.

    To do so, add the EsMeCaTa input file (either bordenave_et_al_2013.tsv or schwab_et_al_2022.tsv`) to the parameter --infile and the abundance file (either bordenave_et_al_2013_abundance.csv or schwab_et_al_2022_abundance.tsv) to the parameter --inAbundfile. The precomputed database is required and can be given with the parameter --precomputedDB. The database can be downloaded from Zenodo.

    Commands for the Bordenave et al. dataset:

    nextflow run ArnaudBelcour/tabigecy --infile bordenave_et_al_2013.tsv --inAbundfile bordenave_et_al_2013_abundance.csv --precomputedDB /path/to/esmecata_database.zip --outputFolder output_bordenave --coreBigecyhmm xx

    Commands for the Schwab et al. dataset:

    nextflow run ArnaudBelcour/tabigecy --infile schwab_et_al_2022.tsv --inAbundfile schwab_et_al_2022_abundance.tsv --precomputedDB /path/to/esmecata_database.zip --outputFolder output_schwab --coreBigecyhmm xx

    To decrease the runtime of the workflow, it is advised to give several cores to `--coreBigecyhmm xx`. With 5 cores, the runtime of the workflow is around 13 minutes.

    To create polar plots, call the two Python scripts at the same location where the input files and output folder are (inside article_data folder):

    python3 bordenave_create_figure_article.py

    python3 schwab_create_figure_article.py

    To create the PCA and correlation plots, launch the R script on the same location:

    Rscript create_pca.R

    Metadata

    The experiments were performed with the following tool versions:

    ToolVersion
    Java (OpenJDK)
    11.0.22
    Nextflow24.10.3
    Tabigecy0.1.1
    Python3.12.2
    EsMeCaTa0.6.0
    EsMeCaTa precomputed database1.0.0
    ete33.1.3
    biopython1.83
    bigecyhmm0.1.5
    pandas1.5.3
    plotly5.19.0
    matplotlib3.9.2
    seaborn0.13.2
    kaleido0.2.1
    pyhmmer0.10.8
    pillow10.1.0
    R4.4.1
    factoextra1.0.7
    ade41.7-22
    corrplot0.94
  13. Z

    Data from: Mining Rule Violations in JavaScript Code Snippets

    • data.niaid.nih.gov
    • explore.openaire.eu
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bonifácio, Rodrigo (2020). Mining Rule Violations in JavaScript Code Snippets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2593817
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Moraes, João Pedro
    Ferreira Campos, Uriel
    Smethurst, Guilherme
    Pinto, Gustavo
    Bonifácio, Rodrigo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Content of this repository This is the repository that contains the scripts and dataset for the MSR 2019 mining challenge

    Github Repository with the software used : here.

    DATASET The dataset was retrived utilizing google bigquery and dumped to a csv file for further processing, this original file with no treatment is called jsanswers.csv, here we can find the following information : 1. The Id of the question (PostId) 2. The Content (in this case the code block) 3. the lenght of the code block 4. the line count of the code block 5. The score of the post 6. The title

    A quick look at this files, one can notice that a postID can have multiple rows related to it, that's how multiple codeblocks are saved in the database.

    Filtered Dataset:

    Extracting code from CSV We used a python script called "ExtractCodeFromCSV.py" to extract the code from the original csv and merge all the codeblocks in their respective javascript file with the postID as name, this resulted in 336 thousand files.

    Running ESlint Due to the single threaded nature of ESlint, we needed to create a script to run ESlint because it took a huge toll on the machine to run it on 336 thousand files, this script is named "ESlintRunnerScript.py", it splits the files in 20 evenly distributed parts and runs 20 processes of esLinter to generate the reports, as such it generates 20 json files.

    Number of Violations per Rule This information was extracted using the script named "parser.py", it generated the file named "NumberofViolationsPerRule.csv" which contains the number of violations per rule used in the linter configuration in the dataset.

    Number of violations per Category As a way to make relevant statistics of the dataset, we generated the number of violations per rule category as defined in the eslinter website, this information was extracted using the same "parser.py" script.

    Individual Reports This information was extracted from the json reports, it's a csv file with PostID and violations per rule.

    Rules The file Rules with categories contains all the rules used and their categories.

  14. d

    Replication Data for Exploring an extinct society through the lens of...

    • dataone.org
    Updated Dec 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wieczorek, Oliver; Malzahn, Melanie (2023). Replication Data for Exploring an extinct society through the lens of Habitus-Field theory and the Tocharian text corpus [Dataset]. http://doi.org/10.7910/DVN/UF8DHK
    Explore at:
    Dataset updated
    Dec 16, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Wieczorek, Oliver; Malzahn, Melanie
    Description

    The files and workflow will allow you to replicate the study titled "Exploring an extinct society through the lens of Habitus-Field theory and the Tocharian text corpus". This study aimed at utilizing the CEToM-corpus (https://cetom.univie.ac.at/) (Tocharian) to analyze the life-world of the elites of an extinct society situated in modern eastern China. To acquire the raw data needed for steps 1 & 2, please contact Melanie Malzahn melanie.malzahn@univie.ac.at. We conducted a mixed methods study, containing of close reading, content analysis, and multiple correspondence analysis (MCA). The excel file titled "fragments_architecture_combined.xlsx" allows for replication of the MCA and equates to the third step of the workflow outlined below. We used the following programming languages and packages to prepare the dataset and to analyze the data. Data preparation and merging procedures were achieved in python (version 3.9.10) with packages pandas (version 1.5.3), os (version 3.12.0), re (version 3.12.0), numpy (version 1.24.3), gensim (version 4.3.1), BeautifulSoup4 (version 4.12.2), pyasn1 (version 0.4.8), and langdetect (version 1.0.9). Multiple Correspondence Analyses were conducted in R (version 4.3.2) with the packages FactoMineR (version 2.9), factoextra (version 1.0.7), readxl version(1.4.3), tidyverse version(2.0.0), ggplot2 (version 3.4.4) and psych (version 2.3.9). After requesting the necessary files, please open the scripts in the order outlined bellow and execute the code-files to replicate the analysis: Preparatory step: Create a folder for the python and r-scripts downloadable in this repository. Open the file 0_create folders.py and declare a root folder in line 19. This first script will generate you the following folders: "tarim-brahmi_database" = Folder, which contains tocharian dictionaries and tocharian text fragments. "dictionaries" = contains tocharian A and tocharian B vocabularies, including linguistic features such as translations, meanings, part of speech tags etc. A full overview of the words is provided on https://cetom.univie.ac.at/?words. "fragments" = contains tocharian text fragments as xml-files. "word_corpus_data" = folder will contain excel-files of the corpus data after the first step. "Architectural_terms" = This folder contains the data on the architectural terms used in the dataset (e.g. dwelling, house). "regional_data" = This folder contains the data on the findsports (tocharian and modern chinese equivalent, e.g. Duldur-Akhur & Kucha). "mca_ready_data" = This is the folder, in which the excel-file with the merged data will be saved. Note that the prepared file named "fragments_architecture_combined.xlsx" can be saved into this directory. This allows you to skip steps 1 &2 and reproduce the MCA of the content analysis based on the third step of our workflow (R-Script titled 3_conduct_MCA.R). First step - run 1_read_xml-files.py: loops over the xml-files in folder dictionaries and identifies a) word metadata, including language (Tocharian A or B), keywords, part of speech, lemmata, word etymology, and loan sources. Then, it loops over the xml-textfiles and extracts a text id number, langauge (Tocharian A or B), text title, text genre, text subgenre, prose type, verse type, material on which the text is written, medium, findspot, the source text in tocharian, and the translation where available. After successful feature extraction, the resulting pandas dataframe object is exported to the word_corpus_data folder. Second step - run 2_merge_excel_files.py: merges all excel files (corpus, data on findspots, word data) and reproduces the content analysis, which was based upon close reading in the first place. Third step - run 3_conduct_MCA.R: recodes, prepares, and selects the variables necessary to conduct the MCA. Then produces the descriptive values, before conducitng the MCA, identifying typical texts per dimension, and exporting the png-files uploaded to this repository.

  15. h

    llama-python-codes-30k

    • huggingface.co
    Updated Oct 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FLOCK4H (2023). llama-python-codes-30k [Dataset]. https://huggingface.co/datasets/flytech/llama-python-codes-30k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 24, 2023
    Authors
    FLOCK4H
    License

    https://choosealicense.com/licenses/llama2/https://choosealicense.com/licenses/llama2/

    Description

    Python Codes - 30k examples, Llama1&2 tokenized dataset

      Author
    

    FlyTech For general guide on how to create, quantize, merge or inference the model and more, visit: hackmd.io/my_first_ai

      Overview
    

    This dataset serves as a rich resource for various Natural Language Processing tasks such as: Question Answering Text Generation Text-to-Text Generation

    It primarily focuses on instructional tasks in Python, tokenized specifically for the Llama architecture.… See the full description on the dataset page: https://huggingface.co/datasets/flytech/llama-python-codes-30k.

  16. Data from: Long-Term Tracing of Indoor Solar Harvesting

    • zenodo.org
    • explore.openaire.eu
    • +1more
    bin, pdf, tar
    Updated Jul 22, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lukas Sigrist; Lukas Sigrist; Andres Gomez; Andres Gomez; Lothar Thiele; Lothar Thiele (2024). Long-Term Tracing of Indoor Solar Harvesting [Dataset]. http://doi.org/10.5281/zenodo.3346976
    Explore at:
    pdf, tar, binAvailable download formats
    Dataset updated
    Jul 22, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Lukas Sigrist; Lukas Sigrist; Andres Gomez; Andres Gomez; Lothar Thiele; Lothar Thiele
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Information

    This dataset presents long-term term indoor solar harvesting traces and jointly monitored with the ambient conditions. The data is recorded at 6 indoor positions with diverse characteristics at our instituate at ETH Zurich in Zurich, Switzerland.

    The data is collected with a measurement platform [3] consisting of a solar panel (AM-5412) connected to a bq25505 energy harvesting chip that stores the harvested energy in a virtual battery circuit. Two TSL45315 light sensors placed on opposite sides of the solar panel monitor the illuminance level and a BME280 sensor logs ambient conditions like temperature, humidity and air pressure.

    The dataset contains the measurement of the energy flow at the input and the output of the bq25505 harvesting circuit, as well alse the illuminance, temperature, humidity and air pressure measurements of the ambient sensors. The following timestamped data columns are available in the raw measurement format, as well as preprocessed and filtered HDF5 datasets:

    • V_in - Converter input/solar panel output voltage, in volt
    • I_in - Converter input/solar panel output current, in ampere
    • V_bat - Battery voltage (emulated through circuit), in volt
    • I_bat - Net Battery current, in/out flowing current, in ampere
    • Ev_left - Illuminance left of solar panel, in lux
    • Ev_right - Illuminance left of solar panel, in lux
    • P_amb - Ambient air pressure, in pascal
    • RH_amb - Ambient relative humidity, unit-less between 0 and 1
    • T_amb - Ambient temperature, in centigrade Celsius

    The following publication presents and overview of the dataset and more details on the deployment used for data collection. A copy of the abstract is included in this dataset, see the file abstract.pdf.

    L. Sigrist, A. Gomez, and L. Thiele. Dataset: Tracing Indoor Solar Harvesting. In Proceedings of the 2nd Workshop on Data Acquisition To Analysis (DATA '19), 2019. [under submission]

    Folder Structure and Files

    • processed/ - This folder holds the imported, merged and filtered datasets of the power and sensor measurements. The datasets are stored in HDF5 format and split by measurement position posXX and and power and ambient sensor measurements. The files belonging to this folder are contained in archives named yyyy_mm_processed.tar, where yyyy and mm represent the year and month the data was published. A separate file lists the exact content of each archive (see below).
    • raw/ - This folder holds the raw measurement files recorded with the RocketLogger [1, 2] and using the measurement platform available at [3]. The files belonging to this folder are contained in archives named yyyy_mm_raw.tar, where yyyy and mm represent the year and month the data was published. A separate file lists the exact content of each archive (see below).
    • LICENSE - License information for the dataset.
    • README.md - The README file containing this information.
    • abstract.pdf - A copy of the above mentioned abstract submitted to the DATA '19 Workshop, introducing this dataset and the deployment used to collect it.
    • raw_import.ipynb [open in nbviewer] - Jupyter Python notebook to import, merge, and filter the raw dataset from the raw/ folder. This is the exact code used to generate the processed dataset and store it in the HDF5 format in the processed/ folder.
    • raw_preview.ipynb [open in nbviewer] - This Jupyter Python notebook imports the raw dataset directly and plots a preview of the full power trace for all measurement positions.
    • processing_python.ipynb [open in nbviewer] - Jupyter Python notebook demonstrating the import and use of the processed dataset in Python. Calculates column-wise statistics, includes more detailed power plots and the simple energy predictor performance comparison included in the abstract.
    • processing_r.ipynb [open in nbviewer] - Jupyter R notebook demonstrating the import and use of the processed dataset in R. Calculates column-wise statistics and extracts and plots the energy harvesting conversion efficiency included in the abstract. Furthermore, the harvested power is analyzed as a function of the ambient light level.

    Dataset File Lists

    Processed Dataset Files

    The list of the processed datasets included in the yyyy_mm_processed.tar archive is provided in yyyy_mm_processed.files.md. The markdown formatted table lists the name of all files, their size in bytes, as well as the SHA-256 sums.

    Raw Dataset Files

    A list of the raw measurement files included in the yyyy_mm_raw.tar archive(s) is provided in yyyy_mm_raw.files.md. The markdown formatted table lists the name of all files, their size in bytes, as well as the SHA-256 sums.

    Dataset Revisions

    v1.0 (2019-08-03)

    Initial release.
    Includes the data collected from 2017-07-27 to 2019-08-01. The dataset archive files related to this revision are 2019_08_raw.tar and 2019_08_processed.tar.
    For position pos06, the measurements from 2018-01-06 00:00:00 to 2018-01-10 00:00:00 are filtered (data inconsistency in file indoor1_p27.rld).

    Dataset Authors, Copyright and License

    References

    [1] L. Sigrist, A. Gomez, R. Lim, S. Lippuner, M. Leubin, and L. Thiele. Measurement and validation of energy harvesting IoT devices. In Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.

    [2] ETH Zurich, Computer Engineering Group. RocketLogger Project Website, https://rocketlogger.ethz.ch/.

    [3] L. Sigrist. Solar Harvesting and Ambient Tracing Platform, 2019. https://gitlab.ethz.ch/tec/public/employees/sigristl/harvesting_tracing

  17. Learn Data Science Series Part 1

    • kaggle.com
    Updated Dec 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rupesh Kumar (2022). Learn Data Science Series Part 1 [Dataset]. https://www.kaggle.com/datasets/hunter0007/learn-data-science-part-1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 30, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Rupesh Kumar
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Please feel free to share it with others and consider supporting me if you find it helpful ⭐️.

    Overview:

    • Chapter 1: Getting started with pandas
    • Chapter 2: Analysis: Bringing it all together and making decisions
    • Chapter 3: Appending to DataFrame
    • Chapter 4: Boolean indexing of dataframes
    • Chapter 5: Categorical data
    • Chapter 6: Computational Tools
    • Chapter 7: Creating DataFrames
    • Chapter 8: Cross sections of different axes with MultiIndex
    • Chapter 9: Data Types
    • Chapter 10: Dealing with categorical variables
    • Chapter 11: Duplicated data
    • Chapter 12: Getting information about DataFrames
    • Chapter 13: Gotchas of pandas
    • Chapter 14: Graphs and Visualizations
    • Chapter 15: Grouping Data
    • Chapter 16: Grouping Time Series Data
    • Chapter 17: Holiday Calendars
    • Chapter 18: Indexing and selecting data
    • Chapter 19: IO for Google BigQuery
    • Chapter 20: JSON
    • Chapter 21: Making Pandas Play Nice With Native Python Datatypes
    • Chapter 22: Map Values
    • Chapter 23: Merge, join, and concatenate
    • Chapter 24: Meta: Documentation Guidelines
    • Chapter 25: Missing Data
    • Chapter 26: MultiIndex
    • Chapter 27: Pandas Datareader
    • Chapter 28: Pandas IO tools (reading and saving data sets)
    • Chapter 29: pd.DataFrame.apply
    • Chapter 30: Read MySQL to DataFrame
    • Chapter 31: Read SQL Server to Dataframe
    • Chapter 32: Reading files into pandas DataFrame
    • Chapter 33: Resampling
    • Chapter 34: Reshaping and pivoting
    • Chapter 35: Save pandas dataframe to a csv file
    • Chapter 36: Series
    • Chapter 37: Shifting and Lagging Data
    • Chapter 38: Simple manipulation of DataFrames
    • Chapter 39: String manipulation
    • Chapter 40: Using .ix, .iloc, .loc, .at and .iat to access a DataFrame
    • Chapter 41: Working with Time Series
  18. c

    Connecticut CAMA and Parcel Layer

    • geodata.ct.gov
    • data.ct.gov
    • +1more
    Updated Nov 20, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    State of Connecticut (2024). Connecticut CAMA and Parcel Layer [Dataset]. https://geodata.ct.gov/datasets/ctmaps::connecticut-cama-and-parcel-layer
    Explore at:
    Dataset updated
    Nov 20, 2024
    Dataset authored and provided by
    State of Connecticut
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Description

    Coordinate system Update:Notably, this dataset will be provided in NAD 83 Connecticut State Plane (2011) (EPSG 2234) projection, instead of WGS 1984 Web Mercator Auxiliary Sphere (EPSG 3857) which is the coordinate system of the 2023 dataset and will remain in Connecticut State Plane moving forward.Ownership Suppression and Data Access:The updated dataset now includes parcel data for all towns across the state, with some towns featuring fully suppressed ownership information. In these instances, the owner’s name will be replaced with the label "Current Owner," the co-owner’s name will be listed as "Current Co-Owner," and the mailing address will appear as the property address itself. For towns with suppressed ownership data, users should be aware that there was no "Suppression" field in the submission to verify specific details. This measure was implemented this year to help verify compliance with Suppression.New Data Fields:The new dataset introduces the "Land Acres" field, which will display the total acreage for each parcel. This additional field allows for more detailed analysis and better supports planning, zoning, and property valuation tasks. An important new addition is the FIPS code field, which provides the Federal Information Processing Standards (FIPS) code for each parcel’s corresponding block. This allows users to easily identify which block the parcel is in.Updated Service URL:The new parcel service URL includes all the updates mentioned above, such as the improved coordinate system, new data fields, and additional geospatial information. Users are strongly encouraged to transition to the new service as soon as possible to ensure that their workflows remain uninterrupted. The URL for this service will remain persistent moving forward. Once you have transitioned to the new service, the URL will remain constant, ensuring long term stability.For a limited time, the old service will continue to be available, but it will eventually be retired. Users should plan to switch to the new service well before this cutoff to avoid any disruptions in data access.The dataset has combined the Parcels and Computer-Assisted Mass Appraisal (CAMA) data for 2024 into a single dataset. This dataset is designed to make it easier for stakeholders and the GIS community to use and access the information as a geospatial dataset. Included in this dataset are geometries for all 169 municipalities and attribution from the CAMA data for all but one municipality. Pursuant to Section 7-100l of the Connecticut General Statutes, each municipality is required to transmit a digital parcel file and an accompanying assessor’s database file (known as a CAMA report), to its respective regional council of governments (COG) by May 1 annually. These data were gathered from the CT municipalities by the COGs and then submitted to CT OPM. This dataset was created on 10/31/2024 from data collected in 2023-2024. Data was processed using Python scripts and ArcGIS Pro, ensuring standardization and integration of the data.CAMA Notes:The CAMA underwent several steps to standardize and consolidate the information. Python scripts were used to concatenate fields and create a unique identifier for each entry. The resulting dataset contains 1,353,595 entries and information on property assessments and other relevant attributes.CAMA was provided by the towns.Spatial Data Notes:Data processing involved merging the parcels from different municipalities using ArcGIS Pro and Python. The resulting dataset contains 1,290,196 parcels.No alteration has been made to the spatial geometry of the data.Fields that are associated with CAMA data were provided by towns.The data fields that have information from the CAMA were sourced from the towns’ CAMA data.If no field for the parcels was provided for linking back to the CAMA by the town a new field within the original data was selected if it had a match rate above 50%, that joined back to the CAMA.Linking fields were renamed to "Link".All linking fields had a census town code added to the beginning of the value to create a unique identifier per town.Any field that was not town name, Location, Editor, Edit Date, or a field associated back to the CAMA, was not used in the creation of this Dataset.Only the fields related to town name, location, editor, edit date, and link fields associated with the towns’ CAMA were included in the creation of this dataset. Any other field provided in the original data was deleted or not used.Field names for town (Muni, Municipality) were renamed to "Town Name".The attributes included in the data: Town Name OwnerCo-OwnerLinkEditorEdit DateCollection year – year the parcels were submittedLocationMailing AddressMailing CityMailing StateAssessed TotalAssessed LandAssessed BuildingPre-Year Assessed Total Appraised LandAppraised BuildingAppraised OutbuildingConditionModelValuationZoneState UseState Use DescriptionLand Acre Living AreaEffective AreaTotal roomsNumber of bedroomsNumber of BathsNumber of Half-BathsSale PriceSale DateQualifiedOccupancyPrior Sale PricePrior Sale DatePrior Book and PagePlanning RegionFIPS Code *Please note that not all parcels have a link to a CAMA entry.*If any discrepancies are discovered within the data, whether pertaining to geographical inaccuracies or attribute inaccuracy, please directly contact the respective municipalities to request any necessary amendmentsAdditional information about the specifics of data availability and compliance will be coming soon.If you need a WFS service for use in specific applications : Please Click Here

  19. This American Life Podcast Transcript Dataset

    • kaggle.com
    Updated Dec 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). This American Life Podcast Transcript Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/this-american-life-podcast-transcript-dataset/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 18, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    Description

    This American Life Podcast Transcript Dataset

    This American Life Podcast Transcripts with Speaker Information and Timestamps

    By Chris Jewell [source]

    About this dataset

    This dataset provides a comprehensive collection of the transcripts for every episode of the popular podcast This American Life since its inception in November 1995. The dataset includes detailed speaker information, timestamps, and act or segment names for each line spoken throughout the episodes.

    With a focus on web scraping using Python and utilizing the powerful BeautifulSoup library, this dataset was meticulously created to offer researchers and enthusiasts an invaluable resource for various analytical purposes. Whether it be sentiment analysis, linguistic studies, or other forms of textual analysis, these transcripts provide a rich mine of data waiting to be explored.

    The informative columns in this dataset include episode number, radio date (when each episode was aired), title (of each episode), act name (or segment title within an episode), line text (the spoken text by speakers), and speaker class (categorizing speakers into different roles such as host, guest, narrator). The timestamp column further enhances the precision by indicating when each line was spoken during an episode.

    In summary, this comprehensive collection showcases years' worth of captivating storytelling and insightful discussions from This American Life

    How to use the dataset

    • Exploring Episode Information:

      • The episode_number column represents the number assigned to each episode of the podcast. You can use this column to identify and filter specific episodes based on their number.
      • The title column contains the title of each episode. You can utilize it to search for episodes related to specific topics or themes.
      • The radio_date column indicates when an episode was aired on the radio. It helps in understanding chronological order and exploring episodes released during specific time periods.
    • Analyzing Speaker Information:

      • The speaker_class column classifies speakers into different categories such as host, guest, or narrator. You can analyze speakers based on their roles or categories throughout various episodes.
      • By examining individual speakers' lines using the line_text column, you can explore patterns in speech or track conversations involving specific individuals.
    • Understanding Act/Segment Details:

      • Some episodes may have multiple acts or segments that cover different stories within a single episode. The act_name column provides insight into these act titles or segment names.
    • Utilizing Timestamps:

      • Each line spoken by a speaker is associated with a timestamp represented in the timestamp field.This enables mapping spoken lines with specific points within an episode.

    5: Textual Analysis: * Perform sentiment analysis by analyzing text-based sentiments expressed by different speakers across various episodes. * Conduct topic modeling techniques like Latent Dirichlet Allocation (LDA) to identify recurring themes or topics discussed in This American Life episodes. * Utilize natural language processing techniques to understand linguistic patterns, word frequencies, and sentiment changes over time or across different speakers.

    Please note: - Ensure you have basic knowledge of data manipulation, analysis, and visualization techniques. - Consider preprocessing the text data by cleaning punctuations, stopwords, and normalizing words for optimal analysis results. - Feel free to combine this dataset with external sources like additional transcripts for comprehensive analysis.

    Research Ideas

    • Sentiment Analysis: With the transcript data and speaker information, this dataset can be used to perform sentiment analysis on each line spoken by different speakers in the podcast episodes. This can provide insights into the overall tone and sentiment of the podcast episodes.
    • Speaker Analysis: By analyzing the speaker information and their respective lines, this dataset can be used to analyze patterns in terms of who speaks more or less frequently, which speakers are more prominent or influential in certain episodes or acts, and how different speakers contribute to the narrative structure of each episode.
    • Topic Modeling: By using natural language processing techniques, this dataset can be used for topic modeling analysis to identify recurring themes or topics discussed in This American Life episodes. This can help uncover patterns or track how certain topics have evolved over time throughout the podcast's history

    Acknowledgements

    If yo...

  20. d

    (HS 2) Automate Workflows using Jupyter notebook to create Large Extent...

    • search.dataone.org
    • hydroshare.org
    Updated Oct 19, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Young-Don Choi (2024). (HS 2) Automate Workflows using Jupyter notebook to create Large Extent Spatial Datasets [Dataset]. http://doi.org/10.4211/hs.a52df87347ef47c388d9633925cde9ad
    Explore at:
    Dataset updated
    Oct 19, 2024
    Dataset provided by
    Hydroshare
    Authors
    Young-Don Choi
    Description

    We implemented automated workflows using Jupyter notebooks for each state. The GIS processing, crucial for merging, extracting, and projecting GeoTIFF data, was performed using ArcPy—a Python package for geographic data analysis, conversion, and management within ArcGIS (Toms, 2015). After generating state-scale LES (large extent spatial) datasets in GeoTIFF format, we utilized the xarray and rioxarray Python packages to convert GeoTIFF to NetCDF. Xarray is a Python package to work with multi-dimensional arrays and rioxarray is rasterio xarray extension. Rasterio is a Python library to read and write GeoTIFF and other raster formats. Xarray facilitated data manipulation and metadata addition in the NetCDF file, while rioxarray was used to save GeoTIFF as NetCDF. These procedures resulted in the creation of three HydroShare resources (HS 3, HS 4 and HS 5) for sharing state-scale LES datasets. Notably, due to licensing constraints with ArcGIS Pro, a commercial GIS software, the Jupyter notebook development was undertaken on a Windows OS.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Mr.Machine (2022). Data Pre-Processing : Data Integration [Dataset]. https://www.kaggle.com/datasets/ilayaraja07/data-preprocessing-data-integration
Organization logo

Data Pre-Processing : Data Integration

Merge - Join - Concatenate

Explore at:
39 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 2, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mr.Machine
Description

In this exercise, we'll merge the details of students from two datasets, namely student.csv and marks.csv. The student dataset contains columns such as Age, Gender, Grade, and Employed. The marks.csv dataset contains columns such as Mark and City. The Student_id column is common between the two datasets. Follow these steps to complete this exercise

Search
Clear search
Close search
Google apps
Main menu