Facebook
TwitterIn this exercise, we'll merge the details of students from two datasets, namely student.csv and marks.csv. The student dataset contains columns such as Age, Gender, Grade, and Employed. The marks.csv dataset contains columns such as Mark and City. The Student_id column is common between the two datasets. Follow these steps to complete this exercise
Facebook
TwitterILSVRC 2012, commonly known as 'ImageNet' is an image dataset organized according to the WordNet hierarchy. Each meaningful concept in WordNet, possibly described by multiple words or word phrases, is called a "synonym set" or "synset". There are more than 100,000 synsets in WordNet, majority of them are nouns (80,000+). In ImageNet, we aim to provide on average 1000 images to illustrate each synset. Images of each concept are quality-controlled and human-annotated. In its completion, we hope ImageNet will offer tens of millions of cleanly sorted images for most of the concepts in the WordNet hierarchy.
The test split contains 100K images but no labels because no labels have been publicly released. We provide support for the test split from 2012 with the minor patch released on October 10, 2019. In order to manually download this data, a user must perform the following operations:
The resulting tar-ball may then be processed by TFDS.
To assess the accuracy of a model on the ImageNet test split, one must run inference on all images in the split, export those results to a text file that must be uploaded to the ImageNet evaluation server. The maintainers of the ImageNet evaluation server permits a single user to submit up to 2 submissions per week in order to prevent overfitting.
To evaluate the accuracy on the test split, one must first create an account at image-net.org. This account must be approved by the site administrator. After the account is created, one can submit the results to the test server at https://image-net.org/challenges/LSVRC/eval_server.php The submission consists of several ASCII text files corresponding to multiple tasks. The task of interest is "Classification submission (top-5 cls error)". A sample of an exported text file looks like the following:
771 778 794 387 650
363 691 764 923 427
737 369 430 531 124
755 930 755 59 168
The export format is described in full in "readme.txt" within the 2013 development kit available here: https://image-net.org/data/ILSVRC/2013/ILSVRC2013_devkit.tgz Please see the section entitled "3.3 CLS-LOC submission format". Briefly, the format of the text file is 100,000 lines corresponding to each image in the test split. Each line of integers correspond to the rank-ordered, top 5 predictions for each test image. The integers are 1-indexed corresponding to the line number in the corresponding labels file. See labels.txt.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('imagenet2012_subset', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/imagenet2012_subset-1pct-5.0.0.png" alt="Visualization" width="500px">
Facebook
TwitterFirst, we would like to thank the wildland fire advisory group. Their wisdom and guidance helped us build the dataset as it currently exists. Currently, there are multiple, freely available fire datasets that identify wildfire and prescribed fire burned areas across the United States. However, these datasets are all limited in some way. Their time periods could cover only a couple of decades or they may have stopped collecting data many years ago. Their spatial footprints may be limited to a specific geographic area or agency. Their attribute data may be limited to nothing more than a polygon and a year. None of the existing datasets provides a comprehensive picture of fires that have burned throughout the last few centuries. Our dataset uses these existing layers and utilizes a series of both manual processes and ArcGIS Python (arcpy) scripts to merge these existing datasets into a single dataset that encompasses the known wildfires and prescribed fires within the United States and certain territories. Forty different fire layers were utilized in this dataset. First, these datasets were ranked by order of observed quality (Tiers). The datasets were given a common set of attribute fields and as many of these fields were populated as possible within each dataset. All fire layers were then merged together (the merged dataset) by their common attributes to created a merged dataset containing all fire polygons. Polygons were then processed in order of Tier (1-8) so that overlapping polygons in the same year and Tier were dissolved together. Overlapping polygons in subsequent Tiers were removed from the dataset. Attributes from the original datasets of all intersecting polygons in the same year across all Tiers were also merged so that all attributes from all Tiers were included, but only the polygons from the highest ranking Tier were dissolved to form the fire polygon. The resulting product (the combined dataset) has only one fire per year in a given area with one set of attributes. While it combines wildfire data from 40 wildfire layers and therefore has more complete information on wildfires than the datasets that went into it, this dataset has also has its own set of limitations. Please see the Data Quality attributes within the metadata record for additional information on this dataset's limitations. Overall, we believe this dataset is designed be to a comprehensive collection of fire boundaries within the United States and provides a more thorough and complete picture of fires across the United States when compared to the datasets that went into it.
Facebook
TwitterAutomatically describing images using natural sentences is an essential task to visually impaired people's inclusion on the Internet. Although there are many datasets in the literature, most of them contain only English captions, whereas datasets with captions described in other languages are scarce.
PraCegoVer arose on the Internet, stimulating users from social media to publish images, tag #PraCegoVer and add a short description of their content. Inspired by this movement, we have proposed the #PraCegoVer, a multi-modal dataset with Portuguese captions based on posts from Instagram. It is the first large dataset for image captioning in Portuguese with freely annotated images.
Dataset Structure
containing the images. The file dataset.json comprehends a list of json objects with the attributes:
user: anonymized user that made the post;
filename: image file name;
raw_caption: raw caption;
caption: clean caption;
date: post date.
Each instance in dataset.json is associated with exactly one image in the images directory whose filename is pointed by the attribute filename. Also, we provide a sample with five instances, so the users can download the sample to get an overview of the dataset before downloading it completely.
Download Instructions
If you just want to have an overview of the dataset structure, you can download sample.tar.gz. But, if you want to use the dataset, or any of its subsets (63k and 173k), you must download all the files and run the following commands to uncompress and join the files:
cat images.tar.gz.part* > images.tar.gz tar -xzvf images.tar.gz
Alternatively, you can download the entire dataset from the terminal using the python script download_dataset.py available in PraCegoVer repository. In this case, first, you have to download the script and create an access token here. Then, you can run the following command to download and uncompress the image files:
python download_dataset.py --access_token=
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Bar-Lev, D., Orr, I., Sabary, O., Etzion T., & Yakkobi, E. Scalable and robust DNA-based storage via coding theory and deep learning. 2024.
Datasets description
This document provides an overview of the 5 datasets introduced in this work. For each dataset we provide both the raw .fastq files with the sequenced reads, as well as a file that includes the processed binned reads that were obtained by the binning step described in the paper.
The dataset is provided under similar license as the code repository, with scripts for loading and processing the data at: https://github.com/itaiorr/Deep-DNA-based-storage.git
The datasets
The data was synthesized using Twist Bioscience and are differentiated by the sequencing technology used. Two Illumina datasets, both generated by Illumina miSeq. The reads in these two datasets were sequenced with paired-end sequencing, while the merging (stitching) was done with PEAR software. We include both raw reads and stitched reads in our repository under the names:
Pilot Illumina dataset
|
|
Test Illumina dataset
Three Nanopore datasets, all generated by Oxford Nanopore Technologies MinION under the names:
Pilot Nanopore dataset
Test Nanopore first flowcell dataset (termed in the paper as “Nanopore single flowcell”).
Test Nanopore second flowcells dataset
Additionally, for completeness, we also included a file with the processed and binned reads of the test Nanopore dataset of the combined two flowcells dataset (termed in the paper as “Nanopore two flowcells”). This can be found in the file BinnedNanoporeTwoFlowcells.txt.
Detailed description
The binned format was created using the binning step described in the paper. Each cluster of reads appears in the file with a header followed by the reads. More specifically:
The header consists of 2 lines; the first corresponds to the encoded sequence of the clusters, and the second is a line of 18x“*” that should be ignored
The reads in the clusters are provided after the header, where each read is given in a separate line
Each cluster ends with two empty lines
Data processing
To ease the processing of our datasets, we also provide the following Python scripts (see https://github.com/itaiorr/Deep-DNA-based-storage)
Preprocessor.py - includes our preprocessing procedure of the raw reads. The procedure detects and truncates the primers
Parser.py - parses the file of the binned reads and creates two Python dictionaries. In the first dictionary each key is an encoded sequence, and the value is a list of the reads in the cluster. In the second dictionary the keys are the index and the value is a list of the reads in the cluster.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We have created a database that includes both static and dynamic structures of four sodium ion polyanionic cathode materials NaMPO4(olivine) ,NaMPO4(maricite), Na2MSiO4 and Na2.56M1.72(SO4)3 , along with various structures incorporating doping of transition metal ions (M). We consider four different transition metal ions (Fe, Mn, Co, Ni). Sampling was done using structure optimization, ab-initio molecular dynamics and machine learning driven dynamical sampling. The dataset consist of 113,703 structures.For each sampled structure, we record its crystal composition, total energy, atom-wise force vectors, atom-wise magnetic moments, and point charges obtained through Bader analysis. Our polyanionic Sodium ion battery database serves as a valuable addition to existing datasets, enabling the exploration of phase space while providing insights into the dynamic behavior of the materials.For the sampling density functional theory (DFT) calculation were performed using the Vienna Ab initio simulation package (VASP) version 6.4. The Perdew-Burke-Ernzerhof (PBE) functional with Hubbard-U corrections were appliedwas utilized for all calculations. The U-values are similar to the ones used for materials project (Fe: 5.3eV, Mn: 3.9eV, Co: 3.32eV, Ni: 6.2eV). For all calculations, an energy cutoff of 520eV was applied, with a smearing width of 0.01eV and convergence criteria set to 1e-5eV for energy and 0.03eV/Å for forces. All calculations were performed with spin polarization. The k-points employed for the four materials were fixed, with NaMPO4(olivine) and NaMPO4(maricite) utilizing [3,4,6] gamma points, Na2MSiO4 employing [3,4,4] gamma points and Na2.56M1.72(SO4)3 utilizing [2,3,4] gamma points. When constructing supercells, the gamma point in the direction of cell enlargement was halved.The dataset, presented in XYZ format, along with a few Python scripts. The dataset is divided into single transition metal ions structures and multiple transition metal ion structures.This division is provided for each of the four cathode materials: NaMPO4(olivine) ,NaMPO4(maricite), Na2MSiO4 and Na2.56M1.72(SO4)3 . For example, Na2.56M1.72(SO4)3 structures are split into single transition metal ion types Na2M2SO4_alluadite_single.xyz and multiple transition metal ion types Na2M2SO4_alluadite_multiple.xyz. The combined dataset, consisting of 113,703 structures, is available in Combined.xyz.To extract structural compositions and physical properties, the ase.io.read function from ASE version 3.23.0 is used. An example of how to extract data and plot the physical properties is provided in https://github.com/dtu-energy/cathode-generation-workflow/tree/main/extract_data/read_data.py and https://github.com/dtu-energy/cathode-generation-workflow/tree/main/extract_data/utils.py contains two functions, one used to attached Bader charges to an ASE atom object an another to combine multiple XYZ data files.To cite the data please use the doi https://doi.org/10.11583/DTU.27202446
Facebook
TwitterBy UCI [source]
This dataset provides an intimate look into student performance and engagement. It grants researchers access to numerous salient metrics of academic performance which illuminate a broad spectrum of student behaviors: how students interact with online learning material; quantitative indicators reflecting their academic outcomes; as well as demographic data such as age group, gender, prior education level among others.
The main objective of this dataset is to enable analysts and educators alike with empirical insights underpinning individualized learning experiences - specifically in identifying cases when students may be 'at risk'. Given that preventive early interventions have been shown to significantly mitigate chances of course or program withdrawal among struggling students - having accurate predictive measures such as this can greatly steer pedagogical strategies towards being more success oriented.
One unique feature about this dataset is its intricate detailing. Not only does it provide overarching summaries on a per-student basis for each presented courses but it also furnishes data related to assessments (scores & submission dates) along with information on individuals' interactions within VLEs (virtual learning environments) - spanning different types like forums, content pages etc... Such comprehensive collation across multiple contextual layers helps paint an encompassing portrayal of student experience that can guide better instructional design.
Due credit must be given when utilizing this database for research purposes through citation. Specifically referencing (Kuzilek et al., 2015) OU Analyse: Analysing At-Risk Students at The Open University published in Learning Analytics Review is required due to its seminal work related groundings regarding analysis methodologies stem from there.
Immaterial aspects aside - it is important to note that protection of student privacy is paramount within this dataset's terms and conditions. Stringent anonymization techniques have been implemented across sensitive variables - while detailed, profiles can't be traced back to original respondents.
How To Use This Dataset:
Understanding Your Objectives: Ideal objectives for using this dataset could be to identify at-risk students before they drop out of a class or program, improving course design by analyzing how assignments contribute to final grades, or simply examining relationships between different variables and student performance.
Set up your Analytical Environment: Before starting any analysis make sure you have an analytical environment set up where you can load the CSV files included in this dataset. You can use Python notebooks (Jupyter), R Studio or Tableau based software in case you want visual representation as well.
Explore Data Individually: There are seven separate datasets available: Assessments; Courses; Student Assessment; Student Info; Vle (Virtual Learning Environment); Student Registeration and Student Vle. Load these CSVs separately into your environment and do an initial exploration of each one: find out what kind of data they contain (numerical/categorical), if they have missing values etc.
Merge Datasets As the core idea is to track a student’s journey through multiple courses over time, combining these datasets will provide insights from wider perspectives. One way could be merging them using common key columns such as 'code_module', 'code_presentation', & 'id_student'. But make sure that merge should depend on what question you're trying to answer.
Identify Key Metrics Your key metrics will depend on your objectives but might include: overall grade averages per course or assessment type/student/region/gender/age group etc., number of clicks in virtual learning environment, student registration status etc.
Run Your Analysis Now you can run queries to analyze the data relevant to your objectives. Try questions like: What factors most strongly predict whether a student will fail an assessment? or How does course difficulty or the number of allotments per week change students' scores?
Visualization: Visualizing your data can be crucial for understanding patterns and relationships between variables. Use graphs like bar plots, heatmaps, and histograms to represent different aspects of your analyses.
Actionable Insights: The final step is interpreting these results in ways that are meaningf...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The identification of peptide sequences and their post-translational modifications (PTMs) is a crucial step in the analysis of bottom-up proteomics data. The recent development of open modification search (OMS) engines allows virtually all PTMs to be searched for. This not only increases the number of spectra that can be matched to peptides but also greatly advances the understanding of biological roles of PTMs through the identification, and thereby facilitated quantification, of peptidoforms (peptide sequences and their potential PTMs). While the benefits of combining results from multiple protein database search engines has been established previously, similar approaches for OMS results are missing so far. Here, we compare and combine results from three different OMS engines, demonstrating an increase in peptide spectrum matches of 8-18%. The unification of search results furthermore allows for the combined downstream processing of search results, including the mapping to potential PTMs. Finally, we test for the ability of OMS engines to identify glycosylated peptides. The implementation of these engines in the Python framework Ursgal facilitates the straightforward application of OMS with unified parameters and results files, thereby enabling yet unmatched high-throughput, large-scale data analysis.
This dataset includes all relevant results files, databases, and scripts that correspond to the accompanying journal article. Specifically, the following files are deposited:
Homo_sapiens_PXD004452_results.zip: result files from OMS and CS for the dataset PXD004452
Homo_sapiens_PXD013715_results.zip: result files from OMS and CS for the dataset PXD013715
Haloferax_volcanii_PXD021874_results.zip: result files from OMS and CS for the dataset PXD021874
Escherichia_coli_PXD000498_results.zip: result files from OMS and CS for the dataset PXD000498
databases.zip: target-decoy databases for Homo sapiens, Escherichia coli and Haloferax volcanii as well as a glycan database for Homo sapiens
scripts.zip: example scripts for all relevant steps of the analysis
mzml_files.zip: mzML files for all included datasets
ursgal.zip: current version of Ursgal (0.6.7) that has been used to generate the results (for most recent versions see https://github.com/ursgal/ursgal)
Facebook
TwitterThe United States is divided and sub-divided into successively smaller hydrologic units which are classified into four levels: regions, sub-regions, accounting units, and cataloging units. The hydrologic units are arranged or nested within each other, from the largest geographic area (regions) to the smallest geographic area (cataloging units). Each hydrologic unit is identified by a unique hydrologic unit code (HUC) consisting of two to eight digits based on the four levels of classification in the hydrologic unit system. A shapefile (or geodatabase) of watersheds for the state of Alaska and parts of western Canada was created by merging two datasets: the U.S. Watershed Boundary Dataset (WBD) and the Government of Canada's National Hydro Network (NHN). Since many rivers in Alaska are transboundary, the NHN data is necessary to capture their watersheds. The WBD data can be found at https://catalog.data.gov/dataset/usgs-national-watershed-boundary-dataset-wbd-downloadable-data-collection-national-geospatial- and the NHN data can be found here: https://open.canada.ca/data/en/dataset/a4b190fe-e090-4e6d-881e-b87956c07977. The included python script was used to subset and merge the two datasets into the single dataset, archived here.
Facebook
TwitterWe implemented automated workflows using Jupyter notebooks for each state. The GIS processing, crucial for merging, extracting, and projecting GeoTIFF data, was performed using ArcPy—a Python package for geographic data analysis, conversion, and management within ArcGIS (Toms, 2015). After generating state-scale LES (large extent spatial) datasets in GeoTIFF format, we utilized the xarray and rioxarray Python packages to convert GeoTIFF to NetCDF. Xarray is a Python package to work with multi-dimensional arrays and rioxarray is rasterio xarray extension. Rasterio is a Python library to read and write GeoTIFF and other raster formats. Xarray facilitated data manipulation and metadata addition in the NetCDF file, while rioxarray was used to save GeoTIFF as NetCDF. These procedures resulted in the creation of three HydroShare resources (HS 3, HS 4 and HS 5) for sharing state-scale LES datasets. Notably, due to licensing constraints with ArcGIS Pro, a commercial GIS software, the Jupyter notebook development was undertaken on a Windows OS.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Homage to Python
The VanRossum dataset is all Python! I used DataMix to combine a handful of highly rated Python-centric datasets, to get a sampling of each and create something new. This data set has 80,000 entries and is named after Guido Van Rossum, the man who invented Python back in 1991. See the VanRossum Collection on HF for all things related to this dataset.
Alpaca / GPT
There are 2 versions of this dataset available on Huggingface.
VanRossum-GPT… See the full description on the dataset page: https://huggingface.co/datasets/theprint/VanRossum-Alpaca.
Facebook
TwitterThe dataset has combined the Parcels and Computer-Assisted Mass Appraisal (CAMA) data for 2023 into a single dataset. This dataset is designed to make it easier for stakeholders and the GIS community to use and access the information as a geospatial dataset. Included in this dataset are geometries for all 169 municipalities and attribution from the CAMA data for all but one municipality. Pursuant to Section 7-100l of the Connecticut General Statutes, each municipality is required to transmit a digital parcel file and an accompanying assessor’s database file (known as a CAMA report), to its respective regional council of governments (COG) by May 1 annually.
These data were gathered from the CT municipalities by the COGs and then submitted to CT OPM. This dataset was created on 12/08/2023 from data collected in 2022-2023. Data was processed using Python scripts and ArcGIS Pro, ensuring standardization and integration of the data.
CAMA Notes:
The CAMA underwent several steps to standardize and consolidate the information. Python scripts were used to concatenate fields and create a unique identifier for each entry. The resulting dataset contains 1,353,595 entries and information on property assessments and other relevant attributes.
CAMA was provided by the towns.
Canaan parcels are viewable, but no additional information is available since no CAMA data was submitted.
Spatial Data Notes:
Data processing involved merging the parcels from different municipalities using ArcGIS Pro and Python. The resulting dataset contains 1,247,506 parcels.
No alteration has been made to the spatial geometry of the data.
Fields that are associated with CAMA data were provided by towns.
The data fields that have information from the CAMA were sourced from the towns’ CAMA data.
If no field for the parcels was provided for linking back to the CAMA by the town a new field within the original data was selected if it had a match rate above 50%, that joined back to the CAMA.
Linking fields were renamed to "Link".
All linking fields had a census town code added to the beginning of the value to create a unique identifier per town.
Any field that was not town name, Location, Editor, Edit Date, or a field associated back to the CAMA, was not used in the creation of this Dataset.
Only the fields related to town name, location, editor, edit date, and link fields associated with the towns’ CAMA were included in the creation of this dataset. Any other field provided in the original data was deleted or not used.
Field names for town (Muni, Municipality) were renamed to "Town Name".
The attributes included in the data:
Town Name
Owner
Co-Owner
Link
Editor
Edit Date
Collection year – year the parcels were submitted
Location
Mailing Address
Mailing City
Mailing State
Assessed Total
Assessed Land
Assessed Building
Pre-Year Assessed Total
Appraised Land
Appraised Building
Appraised Outbuilding
Condition
Model
Valuation
Zone
State Use
State Use Description
Living Area
Effective Area
Total rooms
Number of bedrooms
Number of Baths
Number of Half-Baths
Sale Price
Sale Date
Qualified
Occupancy
Prior Sale Price
Prior Sale Date
Prior Book and Page
Planning Region
*Please note that not all parcels have a link to a CAMA entry.
*If any discrepancies are discovered within the data, whether pertaining to geographical inaccuracies or attribute inaccuracy, please directly contact the respective municipalities to request any necessary amendments
As of 2/15/2023 - Occupancy, State Use, State Use Description, and Mailing State added to dataset
Additional information about the specifics of data availability and compliance will be coming soon.
Facebook
TwitterThis dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.
This dataset contains the data and scripts to generate the hydrological response variables for surface water in the Clarence Moreton subregion as reported in CLM261 (Gilfedder et al. 2016).
File CLM_AWRA_HRVs_flowchart.png shows the different files in this dataset and how they interact. The python and R-scripts are written by the BA modelling team to, as detailed below, read, combine and analyse the source datasets CLM AWRA model, CLM groundwater model V1 and CLM16swg Surface water gauging station data within the Clarence Moreton Basin to create the hydrological response variables for surface water as reported in CLM2.6.1 (Gilfedder et al. 2016).
R-script HRV_SWGW_CLM.R reads, for each model simulation, the outputs from the surface water model in netcdf format from file Qtot.nc (dataset CLM AWRA model) and the outputs from the groundwater model, flux_change.csv (dataset CLM groundwater model V1) and creates a set of files in subfolder /Output for each GaugeNr and simulation Year:
CLM_GaugeNr_Year_all.csv and CLM_GaugeNR_Year_baseline.csv: the set of 9 HRVs for GaugeNr and Year for all 5000 simulations for baseline conditions
CLM_GaugeNr_Year_CRDP.csv: the set of 9 HRVs for GaugeNr and Year for all 5000 simulations for CRDP conditions (=AWRA streamflow - MODFLOW change in SW-GW flux)
CLM_GaugeNr_Year_minMax.csv: minimum and maximum of HRVs over all 5000 simulations
Python script CLM_collate_DoE_Predictions.py collates that information into following files, for each HRV and each maxtype (absolute maximum (amax), relative maximum (pmax) and time of absolute maximum change (tmax)):
CLM_AWRA_HRV_maxtyp_DoE_Predictions: for each simulation and each gauge_nr, the maxtyp of the HRV over the prediction period (2012 to 2102)
CLM_AWRA_HRV_DoE_Observations: for each simulation and each gauge_nr, the HRV for the years that observations are available
CLM_AWRA_HRV_Observations: summary statistics of each HRV and the observed value (based on data set CLM16swg Surface water gauging station data within the Clarence Moreton Basin)
CLM_AWRA_HRV_maxtyp_Predictions: summary statistics of each HRV
R-script CLM_CreateObjectiveFunction.R calculates for each HRV the objective function value for all simulations and stores it in CLM_AWRA_HRV_ss.csv. This file is used by python script CLM_AWRA_SI.py to generate figure CLM-2615-002-SI.png (sensitivity indices).
The AWRA objective function is combined with the overall objective function from the groundwater model in dataset CLM Modflow Uncertainty Analysis (CLM_MF_DoE_ObjFun.csv) into csv file CLM_AWRA_HRV_oo.csv. This file is used to select behavioural simulations in python script CLM-2615-001-top10.py. This script uses files CLM_NodeOrder.csv and BA_Visualisation.py to create the figures CLM-2616-001-HRV_10pct.png.
Bioregional Assessment Programme (2016) CLM AWRA HRVs Uncertainty Analysis. Bioregional Assessment Derived Dataset. Viewed 28 September 2017, http://data.bioregionalassessments.gov.au/dataset/e51a513d-fde7-44ba-830c-07563a7b2402.
Derived From QLD Dept of Natural Resources and Mines, Groundwater Entitlements 20131204
Derived From Qld 100K mapsheets - Mount Lindsay
Derived From Qld 100K mapsheets - Helidon
Derived From Qld 100K mapsheets - Ipswich
Derived From CLM - Woogaroo Subgroup extent
Derived From CLM - Interpolated surfaces of Alluvium depth
Derived From CLM - Extent of Logan and Albert river alluvial systems
Derived From CLM - Bore allocations NSW v02
Derived From CLM - Bore allocations NSW
Derived From CLM - Bore assignments NSW and QLD summary tables
Derived From CLM - Geology NSW & Qld combined v02
Derived From CLM - Orara-Bungawalbin bedrock
Derived From CLM16gwl NSW Office of Water_GW licence extract linked to spatial locations_CLM_v3_13032014
Derived From CLM groundwater model hydraulic property data
Derived From CLM - Koukandowie FM bedrock
Derived From GEODATA TOPO 250K Series 3, File Geodatabase format (.gdb)
Derived From NSW Office of Water - National Groundwater Information System 20140701
Derived From CLM - Gatton Sandstone extent
Derived From CLM16gwl NSW Office of Water, GW licence extract linked to spatial locations in CLM v2 28022014
Derived From Bioregional Assessment areas v03
Derived From NSW Geological Survey - geological units DRAFT line work.
Derived From Mean Annual Climate Data of Australia 1981 to 2012
Derived From CLM Preliminary Assessment Extent Definition & Report( CLM PAE)
Derived From Qld 100K mapsheets - Caboolture
Derived From CLM - AWRA Calibration Gauges SubCatchments
Derived From CLM - NSW Office of Water Gauge Data for Tweed, Richmond & Clarence rivers. Extract 20140901
Derived From Qld 100k mapsheets - Murwillumbah
Derived From AHGFContractedCatchment - V2.1 - Bremer-Warrill
Derived From Bioregional Assessment areas v01
Derived From Bioregional Assessment areas v02
Derived From QLD Current Exploration Permits for Minerals (EPM) in Queensland 6/3/2013
Derived From Pilot points for prediction interpolation of layer 1 in CLM groundwater model
Derived From CLM - Bore water level NSW
Derived From Climate model 0.05x0.05 cells and cell centroids
Derived From CLM - New South Wales Department of Trade and Investment 3D geological model layers
Derived From CLM - Metgasco 3D geological model formation top grids
Derived From State Transmissivity Estimates for Hydrogeology Cross-Cutting Project
Derived From CLM - Extent of Bremer river and Warrill creek alluvial systems
Derived From NSW Catchment Management Authority Boundaries 20130917
Derived From QLD Department of Natural Resources and Mining Groundwater Database Extract 20131111
Derived From Qld 100K mapsheets - Esk
Derived From QLD Dept of Natural Resources and Mines, Groundwater Entitlements linked to bores and NGIS v4 28072014
Derived From BILO Gridded Climate Data: Daily Climate Data for each year from 1900 to 2012
Derived From CLM - Qld Surface Geology Mapsheets
Derived From NSW Office of Water Pump Test dataset
Derived From [CLM -
Facebook
TwitterThis set of python scripts and Jupyter notebooks constitutes a workflow for seamlessly merging multiple digital elevation models (DEMs) to produce a hydrologically robust high-resolution DEM for large river basins. The DEM merging method is adapted from Gallant, J.C. (2019) Merging lidar with coarser DEMs for hydrodynamic modelling over large areas, in: El Sawah, S. (Ed.) MODSIM2019, 23rd International Congress on Modelling and Simulation. Presented at the 23rd International Congress on Modelling and Simulation (MODSIM2019), Modelling and Simulation Society of Australia and New Zealand. https://mssanz.org.au/modsim2019/K24/gallant.pdf The workflow runs on the CSIRO EASI platform https://research.csiro.au/easi/ and expects data stored in an AWS s3 bucket. Dask is used for parallel processing. The workflow was built to merge all the available high-resolution DEMs for the Murray Darling Basin, Australia, using 852 individual lidar and photogrammetry DEMs from the Geoscience Australia elevation data portal Elvis https://elevation.fsdf.org.au/ and the Forests and Buildings removed DEM (FABDEM; Hawker et al. 2022- https://doi.org/10.1088/1748-9326/ac4d4f), a bare-earth radar-derived, 1 arc-second resolution global elevation model. The seamless composite high-resolution Murray Darling Basin DEM datasets (5 m and 25 m resolutions) produced with this workflow can be downloaded here https://doi.org/10.25919/e1z5-mx88. The workflow is divided into three parts: 1) Preprocessing, 2) DEM merging and 3) Postprocessing and validation. The Jupyter notebooks in the workflow are also provided in html format for initial access to the content, without needing a python kernel.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The final dataset utilised for the publication "Investigating Reinforcement Learning Approaches In Stock Market Trading" was processed by downloading and combining data from multiple reputable sources to suit the specific needs of this project. Raw data were retrieved by downloading them using a Python finance API. Afterwards, Python and NumPy were used to combine and normalise the data to create the final dataset.The raw data was sourced as follows:Stock Prices of NVIDIA & AMD, Financial Indexes, and Commodity Prices: Retrieved from Yahoo Finance.Economic Indicators: Collected from the US Federal Reserve.The dataset was normalised to minute intervals, and the stock prices were adjusted to account for stock splits.This dataset was used for exploring the application of reinforcement learning in stock market trading. After creating the dataset, it was used in s reinforcement learning environment to train several reinforcement learning algorithms, including deep Q-learning, policy networks, policy networks with baselines, actor-critic methods, and time series incorporation. The performance of these algorithms was then compared based on profit made and other financial evaluation metrics, to investigate the application of reinforcement learning algorithms in stock market trading.The attached 'README.txt' contains methodological information and a glossary of all the variables in the .csv file.
Facebook
TwitterCIFAR-10 is the excellent Dataset for many Image processing experiments.
Usage instructions
from os import listdir, makedirs
from os.path import join, exists, expanduser
cache_dir = expanduser(join('~', '.keras'))
if not exists(cache_dir):
makedirs(cache_dir)
datasets_dir = join(cache_dir, 'datasets') # /cifar-10-batches-py
if not exists(datasets_dir):
makedirs(datasets_dir)
# If you have multiple input datasets, change the below cp command accordingly, typically:
# !cp ../input/cifar10-python/cifar-10-python.tar.gz ~/.keras/datasets/
!cp ../input/cifar-10-python.tar.gz ~/.keras/datasets/
!ln -s ~/.keras/datasets/cifar-10-python.tar.gz ~/.keras/datasets/cifar-10-batches-py.tar.gz
!tar xzvf ~/.keras/datasets/cifar-10-python.tar.gz -C ~/.keras/datasets/
def unpickle(file):
import pickle
with open(file, 'rb') as fo:
dict = pickle.load(fo, encoding='bytes')
return dict
!tar xzvf ../input/cifar-10-python.tar.gz
then see section "Dataset layout" in https://www.cs.toronto.edu/~kriz/cifar.html for details
Downloaded directly from here:
https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
See description: https://www.cs.toronto.edu/~kriz/cifar.html
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Facebook
TwitterThe dataset contains the logs used to produce the results described in the publication Cooperative Robotic Exploration of a Planetary Skylight Surface and Lava Cave. Raúl Domínguez et. al. 2025
Cooperative Surface Exploration
CoRob_MP1_results.xlsx: Includes the log produced at the commanding station during the Mission Phase 1. It has been used to produce the results evaluation of the MP1.
cmap.ply: Resulting map of the MP1.
ground_truth_transformed_and_downsampled.ply: Ground truth map used for the evaluation of the cooperative map accuracy.
Ground Truth Rover Logs
The dataset contains the samples used to generate the map provided as ground truth for the cave in the publication Cooperative Robotic Exploration of a Planetary Skylight Surface and Lava Cave. Raúl Domínguez et. al. 2025
The dataset has three parts. Between each of the parts, the data capture had to be interrupted. After each interruption, the position of the rover is not exactly the same as before the interruption. For that reason, it has been quite challenging to generate a full reconstruction using the three parts one after the other. In fact, the last one of the logs has not been filtered, since it was not possible to combine the different parts in a single SLAM reconstruction, the last part was not even pre-processed.
Each log contains:- depthmaps, the raw LiDAR data from the Velodyne 32. Format: tiff.- filtered_cloud, the pre-processed LiDAR data from the Velodyne 32. Format: ply.- joint_states, the motor position values. Unfortunately the back axis passive joint is not included. Format: json.- orientation_samples, the orientation as provided by the IMU sensor. Format: json.
Folders contents
├── 20211117-1112│ ├── depth│ │ └── depth_1637143958347198│ ├── filtered_cloud│ │ └── cloud_1637143958347198│ ├── joint_states│ │ └── joints_state_1637143957824829│ └── orientation_samples│ └── orientation_sample_1637143958005814├── 20211117-1140│ ├── depth│ │ └── depth_1637145649108790│ ├── filtered_cloud│ │ └── cloud_1637145649108790│ ├── joint_states│ │ └── joints_state_1637145648630977│ └── orientation_samples│ └── orientation_sample_1637145648831795└── 20211117-1205 ├── depth │ └── depth_1637147164030135 ├── filtered_cloud │ └── cloud_1637147164330388 ├── joint_states │ └── joints_state_1637147163501574 └── orientation_samples └── orientation_sample_1637147163655187
Cave reconstruction
Coyote 3 Logs
The msgpack datasets can be imported using Python with the pocolog2msgpack library
The geometrical rover model of Coyote 3 is included in URDF format. It can be used in environment reconstruction algorithms which require the positions of the different sensors.
MP3
Includes exports of the log files used to compute the KPIs of the MP3.
MP4
These logs were used to obtain the KPI values for the MP4. It is composed of the following archives:- log_coyote_02-03-2023_13-22_01-exp3.zip- log_coyote_02-03-2023_13-22_01-exp4.zip- log_coyote_02-09-2023_19-14_18_demo_skylight.zip- log_coyote_02-09-2023_19-14_20_demo_teleop.zip- coyote3_odometry_20230209-154158.0003_msgpacks.tar.gz- coyote3_odometry_20230203-125251.0819_msgpacks.tar.gz
Cave PLYs
Two integrated pointclouds and one trajectory produced from logs captured by Coyote 3 inside the cave:- Skylight_subsampled_mesh.ply- teleop_tunnel_pointcloud.ply- traj.ply
Example scripts to load the datasets
The repository https://github.com/Rauldg/corobx_dataset_scripts contains some example scripts which load some of the datasets.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Pursuant to Section 7-100l of the Connecticut General Statutes, each municipality is required to transmit a digital parcel file and an accompanying assessor’s database file (known as a CAMA report), to its respective regional council of governments (COG) by May 1 annually. The dataset has combined the Parcels and Computer-Assisted Mass Appraisal (CAMA) data for 2025 into a single dataset. This dataset is designed to make it easier for stakeholders and the GIS community to use and access the information as a geospatial dataset. Included in this dataset are geometries for all 169 municipalities and attribution from the CAMA data for all but one municipality. These data were gathered from the CT municipalities by the COGs and then submitted to CT OPM. This dataset was created on September 2025 from data collected in 2024-2025. Data was processed using Python scripts and ArcGIS Pro for standardization and integration of the data. To learn more about Parcel and CAMA in CT visit our Parcels Page in the Geodata Portal.Coordinate system: This dataset is provided in NAD 83 Connecticut State Plane (2011) (EPSG 2234) projection as it was for 2024. Prior versions were provided at WGS 1984 Web Mercator Auxiliary Sphere (EPSG 3857). Ownership Suppression: The updated dataset includes parcel data for all towns across the state, with some towns featuring fully suppressed ownership information. In these instances, the owner’s name was replaced with the label "Current Owner," the co-owner’s name will be listed as "Current Co-Owner," and the mailing address will appear as the property address itself. For towns with fully suppressed ownership data, please note that no "Suppression" field was included in the submission to confirm these details and this labeling approach was implemented as the solution.New Data Fields:The new dataset introduces the “Property Zip” and “Mailing Zip” fields, which will display the zip codes for the owner and property.Service URL:In 2024, we implemented a stable URL to maintain public access to the most up-to-date data layer. Users are strongly encouraged to transition to the new service as soon as possible to ensure uninterrupted workflows. This URL will remain persistent, providing long-term stability for your applications and integrations. Once you’ve transitioned to the new service, no further URL changes will be necessary.CAMA Notes:The CAMA underwent several steps to standardize and consolidate the information. Python scripts were used to concatenate fields and create a unique identifier for each entry. The resulting dataset contains 1,354,720 entries and information on property assessments and other relevant attributes.CAMA was provided by the towns.Spatial Data Notes:Data processing involved merging the parcels from different municipalities using ArcGIS Pro and Python. The resulting dataset contains 1,282,833 parcels.No alteration has been made to the spatial geometry of the data.Fields that are associated with CAMA data were provided by towns.The data fields that have information from the CAMA were sourced from the towns’ CAMA data.If no field for the parcels was provided for linking back to the CAMA by the town a new field within the original data was selected if it had a match rate above 50%, that joined back to the CAMA.Linking fields were renamed to "Link".All linking fields had a census town code added to the beginning of the value to create a unique identifier per town.Any field that was not town name, Location, Editor, Edit Date, or a field associated back to the CAMA, was not used in the creation of this Dataset.Only the fields related to town name, location, editor, edit date, and link fields associated with the towns’ CAMA were included in the creation of this dataset. Any other field provided in the original data was deleted or not used.Field names for town (Muni, Municipality) were renamed to "Town Name".Attributes included in the data: Town Name OwnerCo-OwnerLinkEditorEdit DateCollection year – year the parcels were submittedLocationProperty ZipMailing AddressMailing CityMailing StateMailing ZipAssessed TotalAssessed LandAssessed BuildingPre-Year Assessed Total Appraised LandAppraised BuildingAppraised OutbuildingConditionModelValuationZoneState UseState Use DescriptionLand Acre Living AreaEffective AreaTotal roomsNumber of bedroomsNumber of BathsNumber of Half-BathsSale PriceSale DateQualifiedOccupancyPrior Sale PricePrior Sale DatePrior Book and PagePlanning RegionFIPS Code *Please note that not all parcels have a link to a CAMA entry.*If any discrepancies are discovered within the data, whether pertaining to geographical inaccuracies or attribute inaccuracy, please directly contact the respective municipalities to request any necessary amendmentsAdditional information about the specifics of data availability and compliance will be coming soon.If you need a WFS service for use in specific applications : Please Click HereContact: opm.giso@ct.gov
Facebook
TwitterThis repository attempts to assemble the largest Covid-19 epidemiological database in addition to a powerful set of expansive covariates. It includes open, publicly sourced, licensed data relating to demographics, economy, epidemiology, geography, health, hospitalizations, mobility, government response, weather, and more.
This particular dataset corresponds to a join of all the different tables that are part of the repository. Therefore, expect the resulting samples to be highly sparse.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('covid19', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Facebook
TwitterThe files and workflow will allow you to replicate the study titled "Exploring an extinct society through the lens of Habitus-Field theory and the Tocharian text corpus". This study aimed at utilizing the CEToM-corpus (https://cetom.univie.ac.at/) (Tocharian) to analyze the life-world of the elites of an extinct society situated in modern eastern China. To acquire the raw data needed for steps 1 & 2, please contact Melanie Malzahn melanie.malzahn@univie.ac.at. We conducted a mixed methods study, containing of close reading, content analysis, and multiple correspondence analysis (MCA). The excel file titled "fragments_architecture_combined.xlsx" allows for replication of the MCA and equates to the third step of the workflow outlined below. We used the following programming languages and packages to prepare the dataset and to analyze the data. Data preparation and merging procedures were achieved in python (version 3.9.10) with packages pandas (version 1.5.3), os (version 3.12.0), re (version 3.12.0), numpy (version 1.24.3), gensim (version 4.3.1), BeautifulSoup4 (version 4.12.2), pyasn1 (version 0.4.8), and langdetect (version 1.0.9). Multiple Correspondence Analyses were conducted in R (version 4.3.2) with the packages FactoMineR (version 2.9), factoextra (version 1.0.7), readxl version(1.4.3), tidyverse version(2.0.0), ggplot2 (version 3.4.4) and psych (version 2.3.9). After requesting the necessary files, please open the scripts in the order outlined bellow and execute the code-files to replicate the analysis: Preparatory step: Create a folder for the python and r-scripts downloadable in this repository. Open the file 0_create folders.py and declare a root folder in line 19. This first script will generate you the following folders: "tarim-brahmi_database" = Folder, which contains tocharian dictionaries and tocharian text fragments. "dictionaries" = contains tocharian A and tocharian B vocabularies, including linguistic features such as translations, meanings, part of speech tags etc. A full overview of the words is provided on https://cetom.univie.ac.at/?words. "fragments" = contains tocharian text fragments as xml-files. "word_corpus_data" = folder will contain excel-files of the corpus data after the first step. "Architectural_terms" = This folder contains the data on the architectural terms used in the dataset (e.g. dwelling, house). "regional_data" = This folder contains the data on the findsports (tocharian and modern chinese equivalent, e.g. Duldur-Akhur & Kucha). "mca_ready_data" = This is the folder, in which the excel-file with the merged data will be saved. Note that the prepared file named "fragments_architecture_combined.xlsx" can be saved into this directory. This allows you to skip steps 1 &2 and reproduce the MCA of the content analysis based on the third step of our workflow (R-Script titled 3_conduct_MCA.R). First step - run 1_read_xml-files.py: loops over the xml-files in folder dictionaries and identifies a) word metadata, including language (Tocharian A or B), keywords, part of speech, lemmata, word etymology, and loan sources. Then, it loops over the xml-textfiles and extracts a text id number, langauge (Tocharian A or B), text title, text genre, text subgenre, prose type, verse type, material on which the text is written, medium, findspot, the source text in tocharian, and the translation where available. After successful feature extraction, the resulting pandas dataframe object is exported to the word_corpus_data folder. Second step - run 2_merge_excel_files.py: merges all excel files (corpus, data on findspots, word data) and reproduces the content analysis, which was based upon close reading in the first place. Third step - run 3_conduct_MCA.R: recodes, prepares, and selects the variables necessary to conduct the MCA. Then produces the descriptive values, before conducitng the MCA, identifying typical texts per dimension, and exporting the png-files uploaded to this repository.
Facebook
TwitterIn this exercise, we'll merge the details of students from two datasets, namely student.csv and marks.csv. The student dataset contains columns such as Age, Gender, Grade, and Employed. The marks.csv dataset contains columns such as Mark and City. The Student_id column is common between the two datasets. Follow these steps to complete this exercise