43 datasets found

Data Pre-Processing : Data Integration
kaggle.com
Updated Aug 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mr.Machine (2022). Data Pre-Processing : Data Integration [Dataset]. https://www.kaggle.com/datasets/ilayaraja07/data-preprocessing-data-integration
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 2, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mr.Machine
Description
In this exercise, we'll merge the details of students from two datasets, namely student.csv and marks.csv. The student dataset contains columns such as Age, Gender, Grade, and Employed. The marks.csv dataset contains columns such as Mark and City. The Student_id column is common between the two datasets. Follow these steps to complete this exercise
T
imagenet2012_subset
tensorflow.org
Updated Oct 21, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). imagenet2012_subset [Dataset]. https://www.tensorflow.org/datasets/catalog/imagenet2012_subset
Explore at:
Dataset updated
Oct 21, 2024
Description
ILSVRC 2012, commonly known as 'ImageNet' is an image dataset organized according to the WordNet hierarchy. Each meaningful concept in WordNet, possibly described by multiple words or word phrases, is called a "synonym set" or "synset". There are more than 100,000 synsets in WordNet, majority of them are nouns (80,000+). In ImageNet, we aim to provide on average 1000 images to illustrate each synset. Images of each concept are quality-controlled and human-annotated. In its completion, we hope ImageNet will offer tens of millions of cleanly sorted images for most of the concepts in the WordNet hierarchy.

The test split contains 100K images but no labels because no labels have been publicly released. We provide support for the test split from 2012 with the minor patch released on October 10, 2019. In order to manually download this data, a user must perform the following operations:

Download the 2012 test split available here.

Download the October 10, 2019 patch. There is a Google Drive link to the patch provided on the same page.

Combine the two tar-balls, manually overwriting any images in the original archive with images from the patch. According to the instructions on image-net.org, this procedure overwrites just a few images.

The resulting tar-ball may then be processed by TFDS.

To assess the accuracy of a model on the ImageNet test split, one must run inference on all images in the split, export those results to a text file that must be uploaded to the ImageNet evaluation server. The maintainers of the ImageNet evaluation server permits a single user to submit up to 2 submissions per week in order to prevent overfitting.

To evaluate the accuracy on the test split, one must first create an account at image-net.org. This account must be approved by the site administrator. After the account is created, one can submit the results to the test server at https://image-net.org/challenges/LSVRC/eval_server.php The submission consists of several ASCII text files corresponding to multiple tasks. The task of interest is "Classification submission (top-5 cls error)". A sample of an exported text file looks like the following:

771 778 794 387 650 363 691 764 923 427 737 369 430 531 124 755 930 755 59 168

The export format is described in full in "readme.txt" within the 2013 development kit available here: https://image-net.org/data/ILSVRC/2013/ILSVRC2013_devkit.tgz Please see the section entitled "3.3 CLS-LOC submission format". Briefly, the format of the text file is 100,000 lines corresponding to each image in the test split. Each line of integers correspond to the rank-ordered, top 5 predictions for each test image. The integers are 1-indexed corresponding to the line number in the corresponding labels file. See labels.txt.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('imagenet2012_subset', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/imagenet2012_subset-1pct-5.0.0.png" alt="Visualization" width="500px">
c
Combined wildfire datasets for the United States and certain territories,...
s.cnmilf.com
data.usgs.gov
+1more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Combined wildfire datasets for the United States and certain territories, 1800s-Present (combined wildland fire polygons) [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/combined-wildfire-datasets-for-the-united-states-and-certain-territories-1800s-present-com
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
U.S. Geological Survey
Area covered
United States
Description
First, we would like to thank the wildland fire advisory group. Their wisdom and guidance helped us build the dataset as it currently exists. Currently, there are multiple, freely available fire datasets that identify wildfire and prescribed fire burned areas across the United States. However, these datasets are all limited in some way. Their time periods could cover only a couple of decades or they may have stopped collecting data many years ago. Their spatial footprints may be limited to a specific geographic area or agency. Their attribute data may be limited to nothing more than a polygon and a year. None of the existing datasets provides a comprehensive picture of fires that have burned throughout the last few centuries. Our dataset uses these existing layers and utilizes a series of both manual processes and ArcGIS Python (arcpy) scripts to merge these existing datasets into a single dataset that encompasses the known wildfires and prescribed fires within the United States and certain territories. Forty different fire layers were utilized in this dataset. First, these datasets were ranked by order of observed quality (Tiers). The datasets were given a common set of attribute fields and as many of these fields were populated as possible within each dataset. All fire layers were then merged together (the merged dataset) by their common attributes to created a merged dataset containing all fire polygons. Polygons were then processed in order of Tier (1-8) so that overlapping polygons in the same year and Tier were dissolved together. Overlapping polygons in subsequent Tiers were removed from the dataset. Attributes from the original datasets of all intersecting polygons in the same year across all Tiers were also merged so that all attributes from all Tiers were included, but only the polygons from the highest ranking Tier were dissolved to form the fire polygon. The resulting product (the combined dataset) has only one fire per year in a given area with one set of attributes. While it combines wildfire data from 40 wildfire layers and therefore has more complete information on wildfires than the datasets that went into it, this dataset has also has its own set of limitations. Please see the Data Quality attributes within the metadata record for additional information on this dataset's limitations. Overall, we believe this dataset is designed be to a comprehensive collection of fire boundaries within the United States and provides a more thorough and complete picture of fires across the United States when compared to the datasets that went into it.
d
Connecticut State Parcel Layer 2023
catalog.data.gov
data.ct.gov
+2more
Updated May 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
State of Connecticut (2025). Connecticut State Parcel Layer 2023 [Dataset]. https://catalog.data.gov/dataset/connecticut-state-parcel-layer-2023-74a65
Explore at:
Dataset updated
May 10, 2025
Dataset provided by
State of Connecticut
Area covered
Connecticut
Description
The dataset has combined the Parcels and Computer-Assisted Mass Appraisal (CAMA) data for 2023 into a single dataset. This dataset is designed to make it easier for stakeholders and the GIS community to use and access the information as a geospatial dataset. Included in this dataset are geometries for all 169 municipalities and attribution from the CAMA data for all but one municipality. Pursuant to Section 7-100l of the Connecticut General Statutes, each municipality is required to transmit a digital parcel file and an accompanying assessor’s database file (known as a CAMA report), to its respective regional council of governments (COG) by May 1 annually. These data were gathered from the CT municipalities by the COGs and then submitted to CT OPM. This dataset was created on 12/08/2023 from data collected in 2022-2023. Data was processed using Python scripts and ArcGIS Pro, ensuring standardization and integration of the data.CAMA Notes:The CAMA underwent several steps to standardize and consolidate the information. Python scripts were used to concatenate fields and create a unique identifier for each entry. The resulting dataset contains 1,353,595 entries and information on property assessments and other relevant attributes.CAMA was provided by the towns.Canaan parcels are viewable, but no additional information is available since no CAMA data was submitted.Spatial Data Notes:Data processing involved merging the parcels from different municipalities using ArcGIS Pro and Python. The resulting dataset contains 1,247,506 parcels.No alteration has been made to the spatial geometry of the data.Fields that are associated with CAMA data were provided by towns.The data fields that have information from the CAMA were sourced from the towns’ CAMA data.If no field for the parcels was provided for linking back to the CAMA by the town a new field within the original data was selected if it had a match rate above 50%, that joined back to the CAMA.Linking fields were renamed to "Link".All linking fields had a census town code added to the beginning of the value to create a unique identifier per town.Any field that was not town name, Location, Editor, Edit Date, or a field associated back to the CAMA, was not used in the creation of this Dataset.Only the fields related to town name, location, editor, edit date, and link fields associated with the towns’ CAMA were included in the creation of this dataset. Any other field provided in the original data was deleted or not used.Field names for town (Muni, Municipality) were renamed to "Town Name".
d
CLM AWRA HRVs Uncertainty Analysis
data.gov.au
researchdata.edu.au
+1more
Updated Nov 19, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bioregional Assessment Program (2019). CLM AWRA HRVs Uncertainty Analysis [Dataset]. https://data.gov.au/data/dataset/e51a513d-fde7-44ba-830c-07563a7b2402
Explore at:
Dataset updated
Nov 19, 2019
Dataset provided by
Bioregional Assessment Program
Description
Abstract

This dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.

This dataset contains the data and scripts to generate the hydrological response variables for surface water in the Clarence Moreton subregion as reported in CLM261 (Gilfedder et al. 2016).

Dataset History

File CLM_AWRA_HRVs_flowchart.png shows the different files in this dataset and how they interact. The python and R-scripts are written by the BA modelling team to, as detailed below, read, combine and analyse the source datasets CLM AWRA model, CLM groundwater model V1 and CLM16swg Surface water gauging station data within the Clarence Moreton Basin to create the hydrological response variables for surface water as reported in CLM2.6.1 (Gilfedder et al. 2016).

R-script HRV_SWGW_CLM.R reads, for each model simulation, the outputs from the surface water model in netcdf format from file Qtot.nc (dataset CLM AWRA model) and the outputs from the groundwater model, flux_change.csv (dataset CLM groundwater model V1) and creates a set of files in subfolder /Output for each GaugeNr and simulation Year:

CLM_GaugeNr_Year_all.csv and CLM_GaugeNR_Year_baseline.csv: the set of 9 HRVs for GaugeNr and Year for all 5000 simulations for baseline conditions

CLM_GaugeNr_Year_CRDP.csv: the set of 9 HRVs for GaugeNr and Year for all 5000 simulations for CRDP conditions (=AWRA streamflow - MODFLOW change in SW-GW flux)

CLM_GaugeNr_Year_minMax.csv: minimum and maximum of HRVs over all 5000 simulations

Python script CLM_collate_DoE_Predictions.py collates that information into following files, for each HRV and each maxtype (absolute maximum (amax), relative maximum (pmax) and time of absolute maximum change (tmax)):

CLM_AWRA_HRV_maxtyp_DoE_Predictions: for each simulation and each gauge_nr, the maxtyp of the HRV over the prediction period (2012 to 2102)

CLM_AWRA_HRV_DoE_Observations: for each simulation and each gauge_nr, the HRV for the years that observations are available

CLM_AWRA_HRV_Observations: summary statistics of each HRV and the observed value (based on data set CLM16swg Surface water gauging station data within the Clarence Moreton Basin)

CLM_AWRA_HRV_maxtyp_Predictions: summary statistics of each HRV

R-script CLM_CreateObjectiveFunction.R calculates for each HRV the objective function value for all simulations and stores it in CLM_AWRA_HRV_ss.csv. This file is used by python script CLM_AWRA_SI.py to generate figure CLM-2615-002-SI.png (sensitivity indices).

The AWRA objective function is combined with the overall objective function from the groundwater model in dataset CLM Modflow Uncertainty Analysis (CLM_MF_DoE_ObjFun.csv) into csv file CLM_AWRA_HRV_oo.csv. This file is used to select behavioural simulations in python script CLM-2615-001-top10.py. This script uses files CLM_NodeOrder.csv and BA_Visualisation.py to create the figures CLM-2616-001-HRV_10pct.png.

Dataset Citation

Bioregional Assessment Programme (2016) CLM AWRA HRVs Uncertainty Analysis. Bioregional Assessment Derived Dataset. Viewed 28 September 2017, http://data.bioregionalassessments.gov.au/dataset/e51a513d-fde7-44ba-830c-07563a7b2402.

Dataset Ancestors

Derived From QLD Dept of Natural Resources and Mines, Groundwater Entitlements 20131204

Derived From Qld 100K mapsheets - Mount Lindsay

Derived From Qld 100K mapsheets - Helidon

Derived From Qld 100K mapsheets - Ipswich

Derived From CLM - Woogaroo Subgroup extent

Derived From CLM - Interpolated surfaces of Alluvium depth

Derived From CLM - Extent of Logan and Albert river alluvial systems

Derived From CLM - Bore allocations NSW v02

Derived From CLM - Bore allocations NSW

Derived From CLM - Bore assignments NSW and QLD summary tables

Derived From CLM - Geology NSW & Qld combined v02

Derived From CLM - Orara-Bungawalbin bedrock

Derived From CLM16gwl NSW Office of Water_GW licence extract linked to spatial locations_CLM_v3_13032014

Derived From CLM groundwater model hydraulic property data

Derived From CLM - Koukandowie FM bedrock

Derived From GEODATA TOPO 250K Series 3, File Geodatabase format (.gdb)

Derived From NSW Office of Water - National Groundwater Information System 20140701

Derived From CLM - Gatton Sandstone extent

Derived From CLM16gwl NSW Office of Water, GW licence extract linked to spatial locations in CLM v2 28022014

Derived From Bioregional Assessment areas v03

Derived From NSW Geological Survey - geological units DRAFT line work.

Derived From Mean Annual Climate Data of Australia 1981 to 2012

Derived From CLM Preliminary Assessment Extent Definition & Report( CLM PAE)

Derived From Qld 100K mapsheets - Caboolture

Derived From CLM - AWRA Calibration Gauges SubCatchments

Derived From CLM - NSW Office of Water Gauge Data for Tweed, Richmond & Clarence rivers. Extract 20140901

Derived From Qld 100k mapsheets - Murwillumbah

Derived From AHGFContractedCatchment - V2.1 - Bremer-Warrill

Derived From Bioregional Assessment areas v01

Derived From Bioregional Assessment areas v02

Derived From QLD Current Exploration Permits for Minerals (EPM) in Queensland 6/3/2013

Derived From Pilot points for prediction interpolation of layer 1 in CLM groundwater model

Derived From CLM - Bore water level NSW

Derived From Climate model 0.05x0.05 cells and cell centroids

Derived From CLM - New South Wales Department of Trade and Investment 3D geological model layers

Derived From CLM - Metgasco 3D geological model formation top grids

Derived From State Transmissivity Estimates for Hydrogeology Cross-Cutting Project

Derived From CLM - Extent of Bremer river and Warrill creek alluvial systems

Derived From NSW Catchment Management Authority Boundaries 20130917

Derived From QLD Department of Natural Resources and Mining Groundwater Database Extract 20131111

Derived From Qld 100K mapsheets - Esk

Derived From QLD Dept of Natural Resources and Mines, Groundwater Entitlements linked to bores and NGIS v4 28072014

Derived From BILO Gridded Climate Data: Daily Climate Data for each year from 1900 to 2012

Derived From CLM - Qld Surface Geology Mapsheets

Derived From NSW Office of Water Pump Test dataset

Derived From [CLM -
d
Hydrologic Unit (HUC8) Boundaries for Alaskan Watersheds
dataone.org
knb.ecoinformatics.org
+1more
Updated Mar 20, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jared Kibele (2019). Hydrologic Unit (HUC8) Boundaries for Alaskan Watersheds [Dataset]. http://doi.org/10.5063/F14T6GM3
Explore at:
Unique identifier
https://doi.org/10.5063/F14T6GM3
Dataset updated
Mar 20, 2019
Dataset provided by
Knowledge Network for Biocomplexity
Authors
Jared Kibele
Time period covered
Jan 19, 2018
Area covered

Variables measured
name, source, id_numeric, id_original
Description
The United States is divided and sub-divided into successively smaller hydrologic units which are classified into four levels: regions, sub-regions, accounting units, and cataloging units. The hydrologic units are arranged or nested within each other, from the largest geographic area (regions) to the smallest geographic area (cataloging units). Each hydrologic unit is identified by a unique hydrologic unit code (HUC) consisting of two to eight digits based on the four levels of classification in the hydrologic unit system. A shapefile (or geodatabase) of watersheds for the state of Alaska and parts of western Canada was created by merging two datasets: the U.S. Watershed Boundary Dataset (WBD) and the Government of Canada's National Hydro Network (NHN). Since many rivers in Alaska are transboundary, the NHN data is necessary to capture their watersheds. The WBD data can be found at https://catalog.data.gov/dataset/usgs-national-watershed-boundary-dataset-wbd-downloadable-data-collection-national-geospatial- and the NHN data can be found here: https://open.canada.ca/data/en/dataset/a4b190fe-e090-4e6d-881e-b87956c07977. The included python script was used to subset and merge the two datasets into the single dataset, archived here.
Z
Data from: Enhancing Open Modification Searches via a Combined Approach...
data.niaid.nih.gov
zenodo.org
Updated Dec 2, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mechthild Pohlschröder (2020). Enhancing Open Modification Searches via a Combined Approach Facilitated by Ursgal [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4299357
Explore at:
Dataset updated
Dec 2, 2020
Dataset provided by
Stefan Schulze
Christian Fufezan
Sebastian A. Leidel
Aime Bienfait Igiraneza
Manuel Kösters
Mechthild Pohlschröder
Johannes Leufken
Benjamin A. Garcia
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The identification of peptide sequences and their post-translational modifications (PTMs) is a crucial step in the analysis of bottom-up proteomics data. The recent development of open modification search (OMS) engines allows virtually all PTMs to be searched for. This not only increases the number of spectra that can be matched to peptides but also greatly advances the understanding of biological roles of PTMs through the identification, and thereby facilitated quantification, of peptidoforms (peptide sequences and their potential PTMs). While the benefits of combining results from multiple protein database search engines has been established previously, similar approaches for OMS results are missing so far. Here, we compare and combine results from three different OMS engines, demonstrating an increase in peptide spectrum matches of 8-18%. The unification of search results furthermore allows for the combined downstream processing of search results, including the mapping to potential PTMs. Finally, we test for the ability of OMS engines to identify glycosylated peptides. The implementation of these engines in the Python framework Ursgal facilitates the straightforward application of OMS with unified parameters and results files, thereby enabling yet unmatched high-throughput, large-scale data analysis.

This dataset includes all relevant results files, databases, and scripts that correspond to the accompanying journal article. Specifically, the following files are deposited:

Homo_sapiens_PXD004452_results.zip: result files from OMS and CS for the dataset PXD004452

Homo_sapiens_PXD013715_results.zip: result files from OMS and CS for the dataset PXD013715

Haloferax_volcanii_PXD021874_results.zip: result files from OMS and CS for the dataset PXD021874

Escherichia_coli_PXD000498_results.zip: result files from OMS and CS for the dataset PXD000498

databases.zip: target-decoy databases for Homo sapiens, Escherichia coli and Haloferax volcanii as well as a glycan database for Homo sapiens

scripts.zip: example scripts for all relevant steps of the analysis

mzml_files.zip: mzML files for all included datasets

ursgal.zip: current version of Ursgal (0.6.7) that has been used to generate the results (for most recent versions see https://github.com/ursgal/ursgal)
e
Key Characteristics of Algorithms' Dynamics Beyond Accuracy - Evaluation...
b2find.eudat.eu
Updated Jul 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Key Characteristics of Algorithms' Dynamics Beyond Accuracy - Evaluation Tests (v2) - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/3524622d-2099-554c-826a-f2155c3f4bb4
Explore at:
Dataset updated
Jul 31, 2025
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Key Characteristics of Algorithms' Dynamics Beyond Accuracy - Evaluation Tests (v2) conducted for the paper: What do anomaly scores actually mean? Key characteristics of algorithms' dynamics beyond accuracy by F. Iglesias, H. O. Marques, A. Zimek, T. Zseby Context and methodology Anomaly detection is intrinsic to a large number of data analysis applications today. Most of the algorithms used assign an outlierness score to each instance prior to establishing anomalies in a binary form. The experiments in this repository study how different algorithms generate different dynamics in the outlierness scores and react in very different ways to possible model perturbations that affect data. The study elaborated in the referred paper presents new indices and coefficients to assess the dynamics and explores the responses of the algorithms as a function of variations in these indices, revealing key aspects of the interdependence between algorithms, data geometries and the ability to discriminate anomalies. Therefeore, this repository reproduces the conducted experiments, which study eight algorithms (ABOD, HBOS, iForest, K-NN, LOF, OCSVM, SDO and GLOSH), submitted to seven perturbations related to: cardinality, dimensionality, outlier proportion, inlier-outlier density ratio, density layers, clusters and local outliers, and collects behavioural profiles with eleven measurements (Adjusted Average Precission, ROC-AUC, Perini's Confidence [1], Perini's Stability [2], S-curves, Discriminant Power, Robust Coefficients of Variations for Inliers and Outliers, Coherence, Bias and Robustness) under two types of normalization: linear and Gaussian, the latter aiming to standardize the outlierness scores issued by different algorithms [3]. This repository is framed within the research on the following domains: algorithm evaluation, outlier detection, anomaly detection, unsupervised learning, machine learning, data mining, data analysis. Datasets and algorithms can be used for experiment replication and for further evaluation and comparison. References [1] Perini, L., Vercruyssen, V., Davis, J.: Quantifying the confidence of anomaly detectors in their example-wise predictions. In: The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. Springer Verlag (2020). [2] Perini, L., Galvin, C., Vercruyssen, V.: A Ranking Stability Measure for Quantifying the Robustness of Anomaly Detection Methods. In: 2nd Workshop on Evaluation and Experimental Design in Data Mining and Machine Learning @ ECML/PKDD (2020). [3] Kriegel, H.-P., Kröger, P., Schubert, E., Zimek, A.: Interpreting and unifying outlier scores. In: Proceedings of the 2011 SIAM International Conference on Data Mining (SDM), pp. 13–24 (2011) Technical details Experiments are tested Python 3.9.6. Provided scripts generate all synthetic data and results. We keep them in the repo for the sake of comparability and replicability ("outputs.zip" file). The file and folder structure is as follows: "compare_scores_group.py" is a Python script to extract new dynamic indices proposed in the paper. "generate_data.py" is a Python script to generate datasets used for evaluation. "latex_table.py" is a Python script to show results in a latex-table format. "merge_indices.py" is a Python script to merge accuracy and dynamic indices in the same table-structured summary. "metric_corr.py" is a Python script to calculate correlation estimations between indices. "outdet.py" is a Python script that runs outlier detection with different algorithms on diverse datasets. "perini_tests.py" is a Python script to run Perini's confidence and stability on all datasets and algorithms' performances. "scatterplots.py" is a Python script that generates scatter plots for comparing accuracy and dynamic performances. "README.md" provides explanations and step by step instructions for replication. "requirements.txt" contains references to required Python libraries and versions. "outputs.zip" contains all result tables, plots and synthetic data generated with the scripts. [data/real_data] contain CSV versions of the Wilt, Shuttle, Waveform and Cardiotocography datasets (inherited and adapted from the LMU repository) License The CC-BY license applies to all data generated with the "generated_data.py" script. All distributed code is under the GNU GPL license. For the "ExCeeD.py" and "stability.py" scripts, please consult and refer to the original sources provided above.
e
Hydrologic Unit (HUC8) Boundaries for Alaskan Watersheds
knb.ecoinformatics.org
Updated Jan 4, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jared Kibele (2019). Hydrologic Unit (HUC8) Boundaries for Alaskan Watersheds [Dataset]. http://doi.org/10.5063/F1M043MV
Explore at:
Unique identifier
https://doi.org/10.5063/F1M043MV
Dataset updated
Jan 4, 2019
Dataset provided by
Knowledge Network for Biocomplexity
Authors
Jared Kibele
Time period covered
Jan 19, 2018
Area covered

Variables measured
name, source, id_numeric, id_original
Description
The United States is divided and sub-divided into successively smaller hydrologic units which are classified into four levels: regions, sub-regions, accounting units, and cataloging units. The hydrologic units are arranged or nested within each other, from the largest geographic area (regions) to the smallest geographic area (cataloging units). Each hydrologic unit is identified by a unique hydrologic unit code (HUC) consisting of two to eight digits based on the four levels of classification in the hydrologic unit system. A shapefile (or geodatabase) of watersheds for the state of Alaska and parts of western Canada was created by merging two datasets: the U.S. Watershed Boundary Dataset (WBD) and the Government of Canada's National Hydro Network (NHN). Since many rivers in Alaska are transboundary, the NHN data is necessary to capture their watersheds. The WBD data can be found at https://catalog.data.gov/dataset/usgs-national-watershed-boundary-dataset-wbd-downloadable-data-collection-national-geospatial- and the NHN data can be found here: https://open.canada.ca/data/en/dataset/a4b190fe-e090-4e6d-881e-b87956c07977. The included python script was used to subset and merge the two datasets into the single dataset, archived here.
u
Cooperative Robotic Exploration of a Planetary Skylight Surface and Lava...
portaldelainvestigacion.uma.es
Updated 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Domínguez García-Escudero, Raúl; Germa, Thierry; Pérez del Pulgar Mancebo, Carlos Jesús; Domínguez García-Escudero, Raúl; Germa, Thierry; Pérez del Pulgar Mancebo, Carlos Jesús (2025). Cooperative Robotic Exploration of a Planetary Skylight Surface and Lava Cave - Datasets [Dataset]. https://portaldelainvestigacion.uma.es/documentos/67a9c7ce19544708f8c7316e
Explore at:
Dataset updated
2025
Authors
Domínguez García-Escudero, Raúl; Germa, Thierry; Pérez del Pulgar Mancebo, Carlos Jesús; Domínguez García-Escudero, Raúl; Germa, Thierry; Pérez del Pulgar Mancebo, Carlos Jesús
Description
The dataset contains the logs used to produce the results described in the publication Cooperative Robotic Exploration of a Planetary Skylight Surface and Lava Cave. Raúl Domínguez et. al. 2025

Cooperative Surface Exploration

CoRob_MP1_results.xlsx: Includes the log produced at the commanding station during the Mission Phase 1. It has been used to produce the results evaluation of the MP1.

cmap.ply: Resulting map of the MP1.

ground_truth_transformed_and_downsampled.ply: Ground truth map used for the evaluation of the cooperative map accuracy.

Ground Truth Rover Logs

The dataset contains the samples used to generate the map provided as ground truth for the cave in the publication Cooperative Robotic Exploration of a Planetary Skylight Surface and Lava Cave. Raúl Domínguez et. al. 2025

The dataset has three parts. Between each of the parts, the data capture had to be interrupted. After each interruption, the position of the rover is not exactly the same as before the interruption. For that reason, it has been quite challenging to generate a full reconstruction using the three parts one after the other. In fact, the last one of the logs has not been filtered, since it was not possible to combine the different parts in a single SLAM reconstruction, the last part was not even pre-processed.

Each log contains:- depthmaps, the raw LiDAR data from the Velodyne 32. Format: tiff.- filtered_cloud, the pre-processed LiDAR data from the Velodyne 32. Format: ply.- joint_states, the motor position values. Unfortunately the back axis passive joint is not included. Format: json.- orientation_samples, the orientation as provided by the IMU sensor. Format: json.

asguard_v4.urdf: In addition to the datasets, a geometrical robot model is provided which might be needed for environment reconstruction and pose estimation algorithms. Format: URDF.

Folders contents

├── 20211117-1112│ ├── depth│ │ └── depth_1637143958347198│ ├── filtered_cloud│ │ └── cloud_1637143958347198│ ├── joint_states│ │ └── joints_state_1637143957824829│ └── orientation_samples│ └── orientation_sample_1637143958005814├── 20211117-1140│ ├── depth│ │ └── depth_1637145649108790│ ├── filtered_cloud│ │ └── cloud_1637145649108790│ ├── joint_states│ │ └── joints_state_1637145648630977│ └── orientation_samples│ └── orientation_sample_1637145648831795└── 20211117-1205 ├── depth │ └── depth_1637147164030135 ├── filtered_cloud │ └── cloud_1637147164330388 ├── joint_states │ └── joints_state_1637147163501574 └── orientation_samples └── orientation_sample_1637147163655187

Cave reconstruction

first_log_2cm_res_pointcloud-20231222.ply, contains the integrated pointcloud produced from the first of the logs.

Coyote 3 Logs

The msgpack datasets can be imported using Python with the pocolog2msgpack library

The geometrical rover model of Coyote 3 is included in URDF format. It can be used in environment reconstruction algorithms which require the positions of the different sensors.

MP3

Includes exports of the log files used to compute the KPIs of the MP3.

MP4

These logs were used to obtain the KPI values for the MP4. It is composed of the following archives:- log_coyote_02-03-2023_13-22_01-exp3.zip- log_coyote_02-03-2023_13-22_01-exp4.zip- log_coyote_02-09-2023_19-14_18_demo_skylight.zip- log_coyote_02-09-2023_19-14_20_demo_teleop.zip- coyote3_odometry_20230209-154158.0003_msgpacks.tar.gz- coyote3_odometry_20230203-125251.0819_msgpacks.tar.gz

Cave PLYs

Two integrated pointclouds and one trajectory produced from logs captured by Coyote 3 inside the cave:- Skylight_subsampled_mesh.ply- teleop_tunnel_pointcloud.ply- traj.ply

Example scripts to load the datasets

The repository https://github.com/Rauldg/corobx_dataset_scripts contains some example scripts which load some of the datasets.
f
Features of probabilistic linkage solutions available for record linkage...
plos.figshare.com
figshare.com
xls
Updated Oct 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John Prindle; Himal Suthar; Emily Putnam-Hornstein (2023). Features of probabilistic linkage solutions available for record linkage applications. [Dataset]. http://doi.org/10.1371/journal.pone.0291581.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0291581.t001
Dataset updated
Oct 20, 2023
Dataset provided by
PLOS ONE
Authors
John Prindle; Himal Suthar; Emily Putnam-Hornstein
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Features of probabilistic linkage solutions available for record linkage applications.

Datasets and software for article "Predicting coarse-grained representations...

zenodo.org

application/gzip, zip

Updated Jan 30, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Arnaud Belcour; Arnaud Belcour; Hidde de Jong; Hidde de Jong; Delphine Ropers; Delphine Ropers (2025). Datasets and software for article "Predicting coarse-grained representations of biogeochemical cycles from metabarcoding data" [Dataset]. http://doi.org/10.5281/zenodo.14762347

Explore at:

application/gzip, zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.14762347

Dataset updated

Jan 30, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Arnaud Belcour; Arnaud Belcour; Hidde de Jong; Hidde de Jong; Delphine Ropers; Delphine Ropers

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Datasets for Tabigecy article

This repository contains additional information for the article "Predicting coarse-grained representations of biogeochemical cycles from metabarcoding data". This information allows the results presented in the article to be reproduced:

- article_data: this zip archive contains:

input files used for the article experiments:
- bordenave_et_al_2013.tsv, bordenave_et_al_2013_abundance.csv and bordenave_et_al_2013_group.tsv for the Bordenave et al. dataset.
- schwab_et_al_2022.tsv, schwab_et_al_2022_abundance.tsv and schwab_et_al_2022_sample_grouping.tsv for the Schwab et al. dataset.
output folders from Tabigecy run on these inputs:
- output_bordenave: output folder for Bordenave et al. dataset.
- output_schwab: output folder for Schwab et al. dataset.
scripts used to create plots:
- create_pca.R to create PCA biplot and correlation plot from the results of both datasets.
- bordenave_create_figure_article.py to create polar plots for the Bordenave et al. dataset.
- schwab_create_figure_article.py to create polar plots for the Schwab et al. dataset.
several folders containing svg files for the article figures: bordenave_figure, experiment_figure, schwab_figure and workflow_figure.
original_data: original input files for both datasets.

- input_files_esmecata_precomputed_db.zip: six input files created by SPARQL queries on UniProt to extract all taxa associated with species, genus, family, order, class and phylum. They were created using a script available in EsMeCaTa repository: esmecata/precomputed/create_input_precomputation.py. These files were used as input to esmecata proteomes to create the precomputed database.

- database_proteomes_folder.zip: compressed archive containing the proteomes retrieved by EsMeCaTa for species, genus, family, order, class and phylum to create the EsMeCaTa precomputed database version 1.0.0. It is the result of combining the different runs of esmecata proteomes on the 6 taxonomic ranks for all the associated taxa of UniProt. From this folder, the precomputed database has been created by means of the following commands:

Clustering of the proteomes:

esmecata clustering -i database_proteomes_folder -o database_output_clustering -c 32 --remove-tmp

Annotation of the consensus proteomes:

esmecata annotation -i database_output_clustering -o database_output_annotation -e /path/to/eggnog/database -c 32

Merging results from the three folders into the precomputed database:

esmecata_create_db from_runs -iproteomes database_proteomes_folder -iclustering database_output_clustering -iannotation database_output_annotation -o esmecata_precomputed_database --db-version "1.0 -c 10

- software.zip: compressed archive containing the code of the tools developed and used in the article:

bigecyhmm-0.1.5.zip: contains the code of bigecyhmm version 0.1.5 used in the article.
esmecata-0.6.0.zip: contains the code of esmecata version 0.6.0 used in the article.
tabigecy-0.1.1.zip: contains the code of tabigecy version 0.1.1 used in the article.

- taxdmp_2024-10-01.tar.gz: The version of the NCBI Taxonomy database used in the article. To use this version of the database with EsMeCaTa, you have to import it with ete3 using the following command:

python3 -c "from ete3 import NCBITaxa; ncbi = NCBITaxa(); ncbi.update_taxonomy_database('taxdmp_2024-10-01.tar.gz')"

Perform experiments

Analyses performed in the article can be reproduced by running tabigecy on the input files of the article_data archive.

To do so, add the EsMeCaTa input file (either bordenave_et_al_2013.tsv or schwab_et_al_2022.tsv`) to the parameter --infile and the abundance file (either bordenave_et_al_2013_abundance.csv or schwab_et_al_2022_abundance.tsv) to the parameter --inAbundfile. The precomputed database is required and can be given with the parameter --precomputedDB. The database can be downloaded from Zenodo.

Commands for the Bordenave et al. dataset:

nextflow run ArnaudBelcour/tabigecy --infile bordenave_et_al_2013.tsv --inAbundfile bordenave_et_al_2013_abundance.csv --precomputedDB /path/to/esmecata_database.zip --outputFolder output_bordenave --coreBigecyhmm xx

Commands for the Schwab et al. dataset:

nextflow run ArnaudBelcour/tabigecy --infile schwab_et_al_2022.tsv --inAbundfile schwab_et_al_2022_abundance.tsv --precomputedDB /path/to/esmecata_database.zip --outputFolder output_schwab --coreBigecyhmm xx

To decrease the runtime of the workflow, it is advised to give several cores to `--coreBigecyhmm xx`. With 5 cores, the runtime of the workflow is around 13 minutes.

To create polar plots, call the two Python scripts at the same location where the input files and output folder are (inside article_data folder):

python3 bordenave_create_figure_article.py

python3 schwab_create_figure_article.py

To create the PCA and correlation plots, launch the R script on the same location:

Rscript create_pca.R

Metadata

The experiments were performed with the following tool versions:

Tool	Version
Java (OpenJDK)	11.0.22
Nextflow	24.10.3
Tabigecy	0.1.1
Python	3.12.2
EsMeCaTa	0.6.0
EsMeCaTa precomputed database	1.0.0
ete3	3.1.3
biopython	1.83
bigecyhmm	0.1.5
pandas	1.5.3
plotly	5.19.0
matplotlib	3.9.2
seaborn	0.13.2
kaleido	0.2.1
pyhmmer	0.10.8
pillow	10.1.0
R	4.4.1
factoextra	1.0.7
ade4	1.7-22
corrplot	0.94

Z
Data from: Mining Rule Violations in JavaScript Code Snippets
data.niaid.nih.gov
explore.openaire.eu
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bonifácio, Rodrigo (2020). Mining Rule Violations in JavaScript Code Snippets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2593817
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Moraes, João Pedro
Ferreira Campos, Uriel
Smethurst, Guilherme
Pinto, Gustavo
Bonifácio, Rodrigo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Content of this repository This is the repository that contains the scripts and dataset for the MSR 2019 mining challenge

Github Repository with the software used : here.

DATASET The dataset was retrived utilizing google bigquery and dumped to a csv file for further processing, this original file with no treatment is called jsanswers.csv, here we can find the following information : 1. The Id of the question (PostId) 2. The Content (in this case the code block) 3. the lenght of the code block 4. the line count of the code block 5. The score of the post 6. The title

A quick look at this files, one can notice that a postID can have multiple rows related to it, that's how multiple codeblocks are saved in the database.

Filtered Dataset:

Extracting code from CSV We used a python script called "ExtractCodeFromCSV.py" to extract the code from the original csv and merge all the codeblocks in their respective javascript file with the postID as name, this resulted in 336 thousand files.

Running ESlint Due to the single threaded nature of ESlint, we needed to create a script to run ESlint because it took a huge toll on the machine to run it on 336 thousand files, this script is named "ESlintRunnerScript.py", it splits the files in 20 evenly distributed parts and runs 20 processes of esLinter to generate the reports, as such it generates 20 json files.

Number of Violations per Rule This information was extracted using the script named "parser.py", it generated the file named "NumberofViolationsPerRule.csv" which contains the number of violations per rule used in the linter configuration in the dataset.

Number of violations per Category As a way to make relevant statistics of the dataset, we generated the number of violations per rule category as defined in the eslinter website, this information was extracted using the same "parser.py" script.

Individual Reports This information was extracted from the json reports, it's a csv file with PostID and violations per rule.

Rules The file Rules with categories contains all the rules used and their categories.
d
Replication Data for Exploring an extinct society through the lens of...
dataone.org
Updated Dec 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wieczorek, Oliver; Malzahn, Melanie (2023). Replication Data for Exploring an extinct society through the lens of Habitus-Field theory and the Tocharian text corpus [Dataset]. http://doi.org/10.7910/DVN/UF8DHK
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/UF8DHK
Dataset updated
Dec 16, 2023
Dataset provided by
Harvard Dataverse
Authors
Wieczorek, Oliver; Malzahn, Melanie
Description
The files and workflow will allow you to replicate the study titled "Exploring an extinct society through the lens of Habitus-Field theory and the Tocharian text corpus". This study aimed at utilizing the CEToM-corpus (https://cetom.univie.ac.at/) (Tocharian) to analyze the life-world of the elites of an extinct society situated in modern eastern China. To acquire the raw data needed for steps 1 & 2, please contact Melanie Malzahn melanie.malzahn@univie.ac.at. We conducted a mixed methods study, containing of close reading, content analysis, and multiple correspondence analysis (MCA). The excel file titled "fragments_architecture_combined.xlsx" allows for replication of the MCA and equates to the third step of the workflow outlined below. We used the following programming languages and packages to prepare the dataset and to analyze the data. Data preparation and merging procedures were achieved in python (version 3.9.10) with packages pandas (version 1.5.3), os (version 3.12.0), re (version 3.12.0), numpy (version 1.24.3), gensim (version 4.3.1), BeautifulSoup4 (version 4.12.2), pyasn1 (version 0.4.8), and langdetect (version 1.0.9). Multiple Correspondence Analyses were conducted in R (version 4.3.2) with the packages FactoMineR (version 2.9), factoextra (version 1.0.7), readxl version(1.4.3), tidyverse version(2.0.0), ggplot2 (version 3.4.4) and psych (version 2.3.9). After requesting the necessary files, please open the scripts in the order outlined bellow and execute the code-files to replicate the analysis: Preparatory step: Create a folder for the python and r-scripts downloadable in this repository. Open the file 0_create folders.py and declare a root folder in line 19. This first script will generate you the following folders: "tarim-brahmi_database" = Folder, which contains tocharian dictionaries and tocharian text fragments. "dictionaries" = contains tocharian A and tocharian B vocabularies, including linguistic features such as translations, meanings, part of speech tags etc. A full overview of the words is provided on https://cetom.univie.ac.at/?words. "fragments" = contains tocharian text fragments as xml-files. "word_corpus_data" = folder will contain excel-files of the corpus data after the first step. "Architectural_terms" = This folder contains the data on the architectural terms used in the dataset (e.g. dwelling, house). "regional_data" = This folder contains the data on the findsports (tocharian and modern chinese equivalent, e.g. Duldur-Akhur & Kucha). "mca_ready_data" = This is the folder, in which the excel-file with the merged data will be saved. Note that the prepared file named "fragments_architecture_combined.xlsx" can be saved into this directory. This allows you to skip steps 1 &2 and reproduce the MCA of the content analysis based on the third step of our workflow (R-Script titled 3_conduct_MCA.R). First step - run 1_read_xml-files.py: loops over the xml-files in folder dictionaries and identifies a) word metadata, including language (Tocharian A or B), keywords, part of speech, lemmata, word etymology, and loan sources. Then, it loops over the xml-textfiles and extracts a text id number, langauge (Tocharian A or B), text title, text genre, text subgenre, prose type, verse type, material on which the text is written, medium, findspot, the source text in tocharian, and the translation where available. After successful feature extraction, the resulting pandas dataframe object is exported to the word_corpus_data folder. Second step - run 2_merge_excel_files.py: merges all excel files (corpus, data on findspots, word data) and reproduces the content analysis, which was based upon close reading in the first place. Third step - run 3_conduct_MCA.R: recodes, prepares, and selects the variables necessary to conduct the MCA. Then produces the descriptive values, before conducitng the MCA, identifying typical texts per dimension, and exporting the png-files uploaded to this repository.
h
llama-python-codes-30k
huggingface.co
Updated Oct 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FLOCK4H (2023). llama-python-codes-30k [Dataset]. https://huggingface.co/datasets/flytech/llama-python-codes-30k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 24, 2023
Authors
FLOCK4H
License
https://choosealicense.com/licenses/llama2/https://choosealicense.com/licenses/llama2/
Description
Python Codes - 30k examples, Llama1&2 tokenized dataset

Author

FlyTech For general guide on how to create, quantize, merge or inference the model and more, visit: hackmd.io/my_first_ai

Overview

This dataset serves as a rich resource for various Natural Language Processing tasks such as: Question Answering Text Generation Text-to-Text Generation

It primarily focuses on instructional tasks in Python, tokenized specifically for the Llama architecture.… See the full description on the dataset page: https://huggingface.co/datasets/flytech/llama-python-codes-30k.
Data from: Long-Term Tracing of Indoor Solar Harvesting
zenodo.org
explore.openaire.eu
+1more
bin, pdf, tar
Updated Jul 22, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lukas Sigrist; Lukas Sigrist; Andres Gomez; Andres Gomez; Lothar Thiele; Lothar Thiele (2024). Long-Term Tracing of Indoor Solar Harvesting [Dataset]. http://doi.org/10.5281/zenodo.3346976
Explore at:
pdf, tar, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3346976
Dataset updated
Jul 22, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Lukas Sigrist; Lukas Sigrist; Andres Gomez; Andres Gomez; Lothar Thiele; Lothar Thiele
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Information

This dataset presents long-term term indoor solar harvesting traces and jointly monitored with the ambient conditions. The data is recorded at 6 indoor positions with diverse characteristics at our instituate at ETH Zurich in Zurich, Switzerland.

The data is collected with a measurement platform [3] consisting of a solar panel (AM-5412) connected to a bq25505 energy harvesting chip that stores the harvested energy in a virtual battery circuit. Two TSL45315 light sensors placed on opposite sides of the solar panel monitor the illuminance level and a BME280 sensor logs ambient conditions like temperature, humidity and air pressure.

The dataset contains the measurement of the energy flow at the input and the output of the bq25505 harvesting circuit, as well alse the illuminance, temperature, humidity and air pressure measurements of the ambient sensors. The following timestamped data columns are available in the raw measurement format, as well as preprocessed and filtered HDF5 datasets:

V_in - Converter input/solar panel output voltage, in volt

I_in - Converter input/solar panel output current, in ampere

V_bat - Battery voltage (emulated through circuit), in volt

I_bat - Net Battery current, in/out flowing current, in ampere

Ev_left - Illuminance left of solar panel, in lux

Ev_right - Illuminance left of solar panel, in lux

P_amb - Ambient air pressure, in pascal

RH_amb - Ambient relative humidity, unit-less between 0 and 1

T_amb - Ambient temperature, in centigrade Celsius

The following publication presents and overview of the dataset and more details on the deployment used for data collection. A copy of the abstract is included in this dataset, see the file abstract.pdf.

L. Sigrist, A. Gomez, and L. Thiele. Dataset: Tracing Indoor Solar Harvesting. In Proceedings of the 2nd Workshop on Data Acquisition To Analysis (DATA '19), 2019. [under submission]

Folder Structure and Files

processed/ - This folder holds the imported, merged and filtered datasets of the power and sensor measurements. The datasets are stored in HDF5 format and split by measurement position posXX and and power and ambient sensor measurements. The files belonging to this folder are contained in archives named yyyy_mm_processed.tar, where yyyy and mm represent the year and month the data was published. A separate file lists the exact content of each archive (see below).

raw/ - This folder holds the raw measurement files recorded with the RocketLogger [1, 2] and using the measurement platform available at [3]. The files belonging to this folder are contained in archives named yyyy_mm_raw.tar, where yyyy and mm represent the year and month the data was published. A separate file lists the exact content of each archive (see below).

LICENSE - License information for the dataset.

README.md - The README file containing this information.

abstract.pdf - A copy of the above mentioned abstract submitted to the DATA '19 Workshop, introducing this dataset and the deployment used to collect it.

raw_import.ipynb [open in nbviewer] - Jupyter Python notebook to import, merge, and filter the raw dataset from the raw/ folder. This is the exact code used to generate the processed dataset and store it in the HDF5 format in the processed/ folder.

raw_preview.ipynb [open in nbviewer] - This Jupyter Python notebook imports the raw dataset directly and plots a preview of the full power trace for all measurement positions.

processing_python.ipynb [open in nbviewer] - Jupyter Python notebook demonstrating the import and use of the processed dataset in Python. Calculates column-wise statistics, includes more detailed power plots and the simple energy predictor performance comparison included in the abstract.

processing_r.ipynb [open in nbviewer] - Jupyter R notebook demonstrating the import and use of the processed dataset in R. Calculates column-wise statistics and extracts and plots the energy harvesting conversion efficiency included in the abstract. Furthermore, the harvested power is analyzed as a function of the ambient light level.

Dataset File Lists

Processed Dataset Files

The list of the processed datasets included in the yyyy_mm_processed.tar archive is provided in yyyy_mm_processed.files.md. The markdown formatted table lists the name of all files, their size in bytes, as well as the SHA-256 sums.

Raw Dataset Files

A list of the raw measurement files included in the yyyy_mm_raw.tar archive(s) is provided in yyyy_mm_raw.files.md. The markdown formatted table lists the name of all files, their size in bytes, as well as the SHA-256 sums.

Dataset Revisions

v1.0 (2019-08-03)

Initial release.
Includes the data collected from 2017-07-27 to 2019-08-01. The dataset archive files related to this revision are 2019_08_raw.tar and 2019_08_processed.tar.
For position pos06, the measurements from 2018-01-06 00:00:00 to 2018-01-10 00:00:00 are filtered (data inconsistency in file indoor1_p27.rld).

Dataset Authors, Copyright and License

Authors: Lukas Sigrist, Andres Gomez, and Lothar Thiele

Contact: Lukas Sigrist (lukas.sigrist@tik.ee.ethz.ch)

Copyright: (c) 2017-2019, ETH Zurich, Computer Engineering Group

License: Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/)

References

[1] L. Sigrist, A. Gomez, R. Lim, S. Lippuner, M. Leubin, and L. Thiele. Measurement and validation of energy harvesting IoT devices. In Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.

[2] ETH Zurich, Computer Engineering Group. RocketLogger Project Website, https://rocketlogger.ethz.ch/.

[3] L. Sigrist. Solar Harvesting and Ambient Tracing Platform, 2019. https://gitlab.ethz.ch/tec/public/employees/sigristl/harvesting_tracing
Learn Data Science Series Part 1
kaggle.com
Updated Dec 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rupesh Kumar (2022). Learn Data Science Series Part 1 [Dataset]. https://www.kaggle.com/datasets/hunter0007/learn-data-science-part-1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 30, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Rupesh Kumar
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Please feel free to share it with others and consider supporting me if you find it helpful ⭐️.

Overview:

Chapter 1: Getting started with pandas

Chapter 2: Analysis: Bringing it all together and making decisions

Chapter 3: Appending to DataFrame

Chapter 4: Boolean indexing of dataframes

Chapter 5: Categorical data

Chapter 6: Computational Tools

Chapter 7: Creating DataFrames

Chapter 8: Cross sections of different axes with MultiIndex

Chapter 9: Data Types

Chapter 10: Dealing with categorical variables

Chapter 11: Duplicated data

Chapter 12: Getting information about DataFrames

Chapter 13: Gotchas of pandas

Chapter 14: Graphs and Visualizations

Chapter 15: Grouping Data

Chapter 16: Grouping Time Series Data

Chapter 17: Holiday Calendars

Chapter 18: Indexing and selecting data

Chapter 19: IO for Google BigQuery

Chapter 20: JSON

Chapter 21: Making Pandas Play Nice With Native Python Datatypes

Chapter 22: Map Values

Chapter 23: Merge, join, and concatenate

Chapter 24: Meta: Documentation Guidelines

Chapter 25: Missing Data

Chapter 26: MultiIndex

Chapter 27: Pandas Datareader

Chapter 28: Pandas IO tools (reading and saving data sets)

Chapter 29: pd.DataFrame.apply

Chapter 30: Read MySQL to DataFrame

Chapter 31: Read SQL Server to Dataframe

Chapter 32: Reading files into pandas DataFrame

Chapter 33: Resampling

Chapter 34: Reshaping and pivoting

Chapter 35: Save pandas dataframe to a csv file

Chapter 36: Series

Chapter 37: Shifting and Lagging Data

Chapter 38: Simple manipulation of DataFrames

Chapter 39: String manipulation

Chapter 40: Using .ix, .iloc, .loc, .at and .iat to access a DataFrame

Chapter 41: Working with Time Series
c
Connecticut CAMA and Parcel Layer
geodata.ct.gov
data.ct.gov
+1more
Updated Nov 20, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
State of Connecticut (2024). Connecticut CAMA and Parcel Layer [Dataset]. https://geodata.ct.gov/datasets/ctmaps::connecticut-cama-and-parcel-layer
Explore at:
Dataset updated
Nov 20, 2024
Dataset authored and provided by
State of Connecticut
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered

Description
Coordinate system Update:Notably, this dataset will be provided in NAD 83 Connecticut State Plane (2011) (EPSG 2234) projection, instead of WGS 1984 Web Mercator Auxiliary Sphere (EPSG 3857) which is the coordinate system of the 2023 dataset and will remain in Connecticut State Plane moving forward.Ownership Suppression and Data Access:The updated dataset now includes parcel data for all towns across the state, with some towns featuring fully suppressed ownership information. In these instances, the owner’s name will be replaced with the label "Current Owner," the co-owner’s name will be listed as "Current Co-Owner," and the mailing address will appear as the property address itself. For towns with suppressed ownership data, users should be aware that there was no "Suppression" field in the submission to verify specific details. This measure was implemented this year to help verify compliance with Suppression.New Data Fields:The new dataset introduces the "Land Acres" field, which will display the total acreage for each parcel. This additional field allows for more detailed analysis and better supports planning, zoning, and property valuation tasks. An important new addition is the FIPS code field, which provides the Federal Information Processing Standards (FIPS) code for each parcel’s corresponding block. This allows users to easily identify which block the parcel is in.Updated Service URL:The new parcel service URL includes all the updates mentioned above, such as the improved coordinate system, new data fields, and additional geospatial information. Users are strongly encouraged to transition to the new service as soon as possible to ensure that their workflows remain uninterrupted. The URL for this service will remain persistent moving forward. Once you have transitioned to the new service, the URL will remain constant, ensuring long term stability.For a limited time, the old service will continue to be available, but it will eventually be retired. Users should plan to switch to the new service well before this cutoff to avoid any disruptions in data access.The dataset has combined the Parcels and Computer-Assisted Mass Appraisal (CAMA) data for 2024 into a single dataset. This dataset is designed to make it easier for stakeholders and the GIS community to use and access the information as a geospatial dataset. Included in this dataset are geometries for all 169 municipalities and attribution from the CAMA data for all but one municipality. Pursuant to Section 7-100l of the Connecticut General Statutes, each municipality is required to transmit a digital parcel file and an accompanying assessor’s database file (known as a CAMA report), to its respective regional council of governments (COG) by May 1 annually. These data were gathered from the CT municipalities by the COGs and then submitted to CT OPM. This dataset was created on 10/31/2024 from data collected in 2023-2024. Data was processed using Python scripts and ArcGIS Pro, ensuring standardization and integration of the data.CAMA Notes:The CAMA underwent several steps to standardize and consolidate the information. Python scripts were used to concatenate fields and create a unique identifier for each entry. The resulting dataset contains 1,353,595 entries and information on property assessments and other relevant attributes.CAMA was provided by the towns.Spatial Data Notes:Data processing involved merging the parcels from different municipalities using ArcGIS Pro and Python. The resulting dataset contains 1,290,196 parcels.No alteration has been made to the spatial geometry of the data.Fields that are associated with CAMA data were provided by towns.The data fields that have information from the CAMA were sourced from the towns’ CAMA data.If no field for the parcels was provided for linking back to the CAMA by the town a new field within the original data was selected if it had a match rate above 50%, that joined back to the CAMA.Linking fields were renamed to "Link".All linking fields had a census town code added to the beginning of the value to create a unique identifier per town.Any field that was not town name, Location, Editor, Edit Date, or a field associated back to the CAMA, was not used in the creation of this Dataset.Only the fields related to town name, location, editor, edit date, and link fields associated with the towns’ CAMA were included in the creation of this dataset. Any other field provided in the original data was deleted or not used.Field names for town (Muni, Municipality) were renamed to "Town Name".The attributes included in the data: Town Name OwnerCo-OwnerLinkEditorEdit DateCollection year – year the parcels were submittedLocationMailing AddressMailing CityMailing StateAssessed TotalAssessed LandAssessed BuildingPre-Year Assessed Total Appraised LandAppraised BuildingAppraised OutbuildingConditionModelValuationZoneState UseState Use DescriptionLand Acre Living AreaEffective AreaTotal roomsNumber of bedroomsNumber of BathsNumber of Half-BathsSale PriceSale DateQualifiedOccupancyPrior Sale PricePrior Sale DatePrior Book and PagePlanning RegionFIPS Code *Please note that not all parcels have a link to a CAMA entry.*If any discrepancies are discovered within the data, whether pertaining to geographical inaccuracies or attribute inaccuracy, please directly contact the respective municipalities to request any necessary amendmentsAdditional information about the specifics of data availability and compliance will be coming soon.If you need a WFS service for use in specific applications : Please Click Here
This American Life Podcast Transcript Dataset
kaggle.com
Updated Dec 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). This American Life Podcast Transcript Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/this-american-life-podcast-transcript-dataset/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 18, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
Description
This American Life Podcast Transcript Dataset

This American Life Podcast Transcripts with Speaker Information and Timestamps

By Chris Jewell [source]

About this dataset

This dataset provides a comprehensive collection of the transcripts for every episode of the popular podcast This American Life since its inception in November 1995. The dataset includes detailed speaker information, timestamps, and act or segment names for each line spoken throughout the episodes.

With a focus on web scraping using Python and utilizing the powerful BeautifulSoup library, this dataset was meticulously created to offer researchers and enthusiasts an invaluable resource for various analytical purposes. Whether it be sentiment analysis, linguistic studies, or other forms of textual analysis, these transcripts provide a rich mine of data waiting to be explored.

The informative columns in this dataset include episode number, radio date (when each episode was aired), title (of each episode), act name (or segment title within an episode), line text (the spoken text by speakers), and speaker class (categorizing speakers into different roles such as host, guest, narrator). The timestamp column further enhances the precision by indicating when each line was spoken during an episode.

In summary, this comprehensive collection showcases years' worth of captivating storytelling and insightful discussions from This American Life

How to use the dataset

Exploring Episode Information:

The episode_number column represents the number assigned to each episode of the podcast. You can use this column to identify and filter specific episodes based on their number.

The title column contains the title of each episode. You can utilize it to search for episodes related to specific topics or themes.

The radio_date column indicates when an episode was aired on the radio. It helps in understanding chronological order and exploring episodes released during specific time periods.

Analyzing Speaker Information:

The speaker_class column classifies speakers into different categories such as host, guest, or narrator. You can analyze speakers based on their roles or categories throughout various episodes.

By examining individual speakers' lines using the line_text column, you can explore patterns in speech or track conversations involving specific individuals.

Understanding Act/Segment Details:

Some episodes may have multiple acts or segments that cover different stories within a single episode. The act_name column provides insight into these act titles or segment names.

Utilizing Timestamps:

Each line spoken by a speaker is associated with a timestamp represented in the timestamp field.This enables mapping spoken lines with specific points within an episode.

5: Textual Analysis: * Perform sentiment analysis by analyzing text-based sentiments expressed by different speakers across various episodes. * Conduct topic modeling techniques like Latent Dirichlet Allocation (LDA) to identify recurring themes or topics discussed in This American Life episodes. * Utilize natural language processing techniques to understand linguistic patterns, word frequencies, and sentiment changes over time or across different speakers.

Please note: - Ensure you have basic knowledge of data manipulation, analysis, and visualization techniques. - Consider preprocessing the text data by cleaning punctuations, stopwords, and normalizing words for optimal analysis results. - Feel free to combine this dataset with external sources like additional transcripts for comprehensive analysis.

Research Ideas

Sentiment Analysis: With the transcript data and speaker information, this dataset can be used to perform sentiment analysis on each line spoken by different speakers in the podcast episodes. This can provide insights into the overall tone and sentiment of the podcast episodes.

Speaker Analysis: By analyzing the speaker information and their respective lines, this dataset can be used to analyze patterns in terms of who speaks more or less frequently, which speakers are more prominent or influential in certain episodes or acts, and how different speakers contribute to the narrative structure of each episode.

Topic Modeling: By using natural language processing techniques, this dataset can be used for topic modeling analysis to identify recurring themes or topics discussed in This American Life episodes. This can help uncover patterns or track how certain topics have evolved over time throughout the podcast's history

Acknowledgements

If yo...
d
(HS 2) Automate Workflows using Jupyter notebook to create Large Extent...
search.dataone.org
hydroshare.org
Updated Oct 19, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Young-Don Choi (2024). (HS 2) Automate Workflows using Jupyter notebook to create Large Extent Spatial Datasets [Dataset]. http://doi.org/10.4211/hs.a52df87347ef47c388d9633925cde9ad
Explore at:
Unique identifier
https://doi.org/10.4211/hs.a52df87347ef47c388d9633925cde9ad
Dataset updated
Oct 19, 2024
Dataset provided by
Hydroshare
Authors
Young-Don Choi
Description
We implemented automated workflows using Jupyter notebooks for each state. The GIS processing, crucial for merging, extracting, and projecting GeoTIFF data, was performed using ArcPy—a Python package for geographic data analysis, conversion, and management within ArcGIS (Toms, 2015). After generating state-scale LES (large extent spatial) datasets in GeoTIFF format, we utilized the xarray and rioxarray Python packages to convert GeoTIFF to NetCDF. Xarray is a Python package to work with multi-dimensional arrays and rioxarray is rasterio xarray extension. Rasterio is a Python library to read and write GeoTIFF and other raster formats. Xarray facilitated data manipulation and metadata addition in the NetCDF file, while rioxarray was used to save GeoTIFF as NetCDF. These procedures resulted in the creation of three HydroShare resources (HS 3, HS 4 and HS 5) for sharing state-scale LES datasets. Notably, due to licensing constraints with ArcGIS Pro, a commercial GIS software, the Jupyter notebook development was undertaken on a Windows OS.

Facebook

Twitter

Click to copy link

Link copied

Cite

Mr.Machine (2022). Data Pre-Processing : Data Integration [Dataset]. https://www.kaggle.com/datasets/ilayaraja07/data-preprocessing-data-integration

Data Pre-Processing : Data Integration

Merge - Join - Concatenate

Explore at:

39 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Aug 2, 2022

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Mr.Machine

Description

In this exercise, we'll merge the details of students from two datasets, namely student.csv and marks.csv. The student dataset contains columns such as Age, Gender, Grade, and Employed. The marks.csv dataset contains columns such as Mark and City. The Student_id column is common between the two datasets. Follow these steps to complete this exercise

Clear search

Close search

Google apps

Main menu

Data Pre-Processing : Data Integration

imagenet2012_subset

Combined wildfire datasets for the United States and certain territories,...

Connecticut State Parcel Layer 2023

CLM AWRA HRVs Uncertainty Analysis

Abstract

Dataset History

Dataset Citation

Dataset Ancestors

Hydrologic Unit (HUC8) Boundaries for Alaskan Watersheds

Data from: Enhancing Open Modification Searches via a Combined Approach...

Key Characteristics of Algorithms' Dynamics Beyond Accuracy - Evaluation...

Hydrologic Unit (HUC8) Boundaries for Alaskan Watersheds

Cooperative Robotic Exploration of a Planetary Skylight Surface and Lava...

Features of probabilistic linkage solutions available for record linkage...

Datasets and software for article "Predicting coarse-grained representations...

Datasets for Tabigecy article

Perform experiments

Metadata

Data from: Mining Rule Violations in JavaScript Code Snippets

Github Repository with the software used : here.

Replication Data for Exploring an extinct society through the lens of...

llama-python-codes-30k

Data from: Long-Term Tracing of Indoor Solar Harvesting

Learn Data Science Series Part 1

Please feel free to share it with others and consider supporting me if you find it helpful ⭐️.

Overview:

Connecticut CAMA and Parcel Layer

This American Life Podcast Transcript Dataset

This American Life Podcast Transcript Dataset

This American Life Podcast Transcripts with Speaker Information and Timestamps

About this dataset

How to use the dataset

Research Ideas

Acknowledgements

(HS 2) Automate Workflows using Jupyter notebook to create Large Extent...

Data Pre-Processing : Data Integration

Merge - Join - Concatenate