49 datasets found
  1. t

    Data from: Decoding Wayfinding: Analyzing Wayfinding Processes in the...

    • researchdata.tuwien.at
    html, pdf, zip
    Updated Feb 23, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Negar Alinaghi; Ioannis Giannopoulos; Ioannis Giannopoulos; Negar Alinaghi; Negar Alinaghi; Negar Alinaghi (2025). Decoding Wayfinding: Analyzing Wayfinding Processes in the Outdoor Environment [Dataset]. http://doi.org/10.48436/m2ha4-t1v92
    Explore at:
    html, zip, pdfAvailable download formats
    Dataset updated
    Feb 23, 2025
    Dataset provided by
    TU Wien
    Authors
    Negar Alinaghi; Ioannis Giannopoulos; Ioannis Giannopoulos; Negar Alinaghi; Negar Alinaghi; Negar Alinaghi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Folder Structure

    The folder named “submission” contains the following:

    1. “pythonProject”: This folder contains all the Python files and subfolders needed for analysis.
    2. ijgis.yml: This file lists all the Python libraries and dependencies required to run the code.

    Setting Up the Environment

    1. Use the ijgis.yml file to create a Python project and environment. Ensure you activate the environment before running the code.
    2. The pythonProject folder contains several .py files and subfolders, each with specific functionality as described below.

    Subfolders

    1. Data_4_IJGIS

    • This folder contains the data used for the results reported in the paper.
    • Note: The data analysis that we explain in this paper already begins with the synchronization and cleaning of the recorded raw data. The published data is already synchronized and cleaned. Both the cleaned files and the merged files with features extracted for them are given in this directory. If you want to perform the segmentation and feature extraction yourself, you should run the respective Python files yourself. If not, you can use the “merged_…csv” files as input for the training.

    2. results_[DateTime] (e.g., results_20240906_15_00_13)

    • This folder will be generated when you run the code and will store the output of each step.
    • The current folder contains results created during code debugging for the submission.
    • When you run the code, a new folder with fresh results will be generated.

    Python Files

    1. helper_functions.py

    • Contains reusable functions used throughout the analysis.
    • Each function includes a description of its purpose and the input parameters required.

    2. create_sanity_plots.py

    • Generates scatter plots like those in Figure 3 of the paper.
    • Although the code has been run for all 309 trials, it can be used to check the sample data provided.
    • Output: A .png file for each column of the raw gaze and IMU recordings, color-coded with logged events.
    • Usage: Run this file to create visualizations similar to Figure 3.

    3. overlapping_sliding_window_loop.py

    • Implements overlapping sliding window segmentation and generates plots like those in Figure 4.
    • Output:
      • Two new subfolders, “Gaze” and “IMU”, will be added to the Data_4_IJGIS folder.
      • Segmented files (default: 2–10 seconds with a 1-second step size) will be saved as .csv files.
      • A visualization of the segments, similar to Figure 4, will be automatically generated.

    4. gaze_features.py & imu_features.py

    • These files compute features as explained in Tables 1 and 2 of the paper, respectively.
    • They process the segmented recordings generated by the overlapping_sliding_window_loop.py.
    • Usage: Just to know how the features are calculated, you can run this code after the segmentation with the sliding window and run these files to calculate the features from the segmented data.

    5. training_prediction.py

    • This file contains the main machine learning analysis of the paper. This file contains all the code for the training of the model, its evaluation, and its use for the inference of the “monitoring part”. It covers the following steps:
    a. Data Preparation (corresponding to Section 5.1.1 of the paper)
    • Prepares the data according to the research question (RQ) described in the paper. Since this data was collected with several RQs in mind, we remove parts of the data that are not related to the RQ of this paper.
    • A function named plot_labels_comparison(df, save_path, x_label_freq=10, figsize=(15, 5)) in line 116 visualizes the data preparation results. As this visualization is not used in the paper, the line is commented out, but if you want to see visually what has been changed compared to the original data, you can comment out this line.
    b. Training/Validation/Test Split
    • Splits the data for machine learning experiments (an explanation can be found in Section 5.1.1. Preparation of data for training and inference of the paper).
    • Make sure that you follow the instructions in the comments to the code exactly.
    • Output: The split data is saved as .csv files in the results folder.
    c. Machine and Deep Learning Experiments

    This part contains three main code blocks:

    iii. One for the XGboost code with correct hyperparameter tuning:
    Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically test the confidence threshold of

    • MLP Network (Commented Out): This code was used for classification with the MLP network, and the results shown in Table 3 are from this code. If you wish to use this model, please comment out the following blocks accordingly.
    • XGBoost without Hyperparameter Tuning: If you want to run the code but do not want to spend time on the full training with hyperparameter tuning (as was done for the paper), just uncomment this part. This will give you a simple, untuned model with which you can achieve at least some results.
    • XGBoost with Hyperparameter Tuning: If you want to train the model the way we trained it for the analysis reported in the paper, use this block (the plots in Figure 7 are from this block). We ran this block with different feature sets and different segmentation files and created a simple bar chart from the saved results, shown in Figure 6.

    Note: Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically calculated the confidence threshold of the model (explained in the paper in Section 5.2. Part II: Decoding surveillance by sequence analysis) is given in this block in lines 361 to 380.

    d. Inference (Monitoring Part)
    • Final inference is performed using the monitoring data. This step produces a .csv file containing inferred labels.
    • Figure 8 in the paper is generated using this part of the code.

    6. sequence_analysis.py

    • Performs analysis on the inferred data, producing Figures 9 and 10 from the paper.
    • This file reads the inferred data from the previous step and performs sequence analysis as described in Sections 5.2.1 and 5.2.2.

    Licenses

    The data is licensed under CC-BY, the code is licensed under MIT.

  2. Pre-Processed Power Grid Frequency Time Series

    • data.subak.org
    csv
    Updated Feb 16, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2023). Pre-Processed Power Grid Frequency Time Series [Dataset]. https://data.subak.org/dataset/pre-processed-power-grid-frequency-time-series
    Explore at:
    csvAvailable download formats
    Dataset updated
    Feb 16, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Description

    Overview

    This repository contains ready-to-use frequency time series as well as the corresponding pre-processing scripts in python. The data covers three synchronous areas of the European power grid:

    • Continental Europe
    • Great Britain
    • Nordic

    This work is part of the paper "Predictability of Power Grid Frequency"[1]. Please cite this paper, when using the data and the code. For a detailed documentation of the pre-processing procedure we refer to the supplementary material of the paper.

    Data sources

    We downloaded the frequency recordings from publically available repositories of three different Transmission System Operators (TSOs).

    • Continental Europe [2]: We downloaded the data from the German TSO TransnetBW GmbH, which retains the Copyright on the data, but allows to re-publish it upon request [3].
    • Great Britain [4]: The download was supported by National Grid ESO Open Data, which belongs to the British TSO National Grid. They publish the frequency recordings under the NGESO Open License [5].
    • Nordic [6]: We obtained the data from the Finish TSO Fingrid, which provides the data under the open license CC-BY 4.0 [7].

    Content of the repository

    A) Scripts

    1. In the "Download_scripts" folder you will find three scripts to automatically download frequency data from the TSO's websites.
    2. In "convert_data_format.py" we save the data with corrected timestamp formats. Missing data is marked as NaN (processing step (1) in the supplementary material of [1]).
    3. In "clean_corrupted_data.py" we load the converted data and identify corrupted recordings. We mark them as NaN and clean some of the resulting data holes (processing step (2) in the supplementary material of [1]).

    The python scripts run with Python 3.7 and with the packages found in "requirements.txt".

    B) Yearly converted and cleansed data

    The folders "

    • File type: The files are zipped csv-files, where each file comprises one year.
    • Data format: The files contain two columns. The second column contains the frequency values in Hz. The first one represents the time stamps in the format Year-Month-Day Hour-Minute-Second, which is given as naive local time. The local time refers to the following time zones and includes Daylight Saving Times (python time zone in brackets):
      • TransnetBW: Continental European Time (CE)
      • Nationalgrid: Great Britain (GB)
      • Fingrid: Finland (Europe/Helsinki)
    • NaN representation: We mark corrupted and missing data as "NaN" in the csv-files.

    Use cases

    We point out that this repository can be used in two different was:

    • Use pre-processed data: You can directly use the converted or the cleansed data. Note however, that both data sets include segments of NaN-values due to missing and corrupted recordings. Only a very small part of the NaN-values were eliminated in the cleansed data to not manipulate the data too much.

    • Produce your own cleansed data: Depending on your application, you might want to cleanse the data in a custom way. You can easily add your custom cleansing procedure in "clean_corrupted_data.py" and then produce cleansed data from the raw data in "

    License

    This work is licensed under multiple licenses, which are located in the "LICENSES" folder.

    • We release the code in the folder "Scripts" under the MIT license .
    • The pre-processed data in the subfolders "**/Fingrid" and "**/Nationalgrid" are licensed under CC-BY 4.0.
    • TransnetBW originally did not publish their data under an open license. We have explicitly received the permission to publish the pre-processed version from TransnetBW. However, we cannot publish our pre-processed version under an open license due to the missing license of the original TransnetBW data.

    Changelog

    Version 2:

    • Add time zone information to description
    • Include new frequency data
    • Update references
    • Change folder structure to yearly folders

    Version 3:

    • Correct TransnetBW files for missing data in May 2016
  3. o

    Data from: ManyTypes4Py: A benchmark Python Dataset for Machine...

    • explore.openaire.eu
    Updated Apr 26, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amir M. Mir; Evaldas Latoskinas; Georgios Gousios (2021). ManyTypes4Py: A benchmark Python Dataset for Machine Learning-Based Type Inference [Dataset]. http://doi.org/10.5281/zenodo.4044635
    Explore at:
    Dataset updated
    Apr 26, 2021
    Authors
    Amir M. Mir; Evaldas Latoskinas; Georgios Gousios
    Description

    The dataset is gathered on Sep. 17th 2020 from GitHub. It has more than 5.2K Python repositories and 4.2M type annotations. The dataset is also de-duplicated using the CD4Py tool. Check out the README.MD file for the description of the dataset. Notable changes to each version of the dataset are documented in CHANGELOG.md. The dataset's scripts and utilities are available on its GitHub repository.

  4. MME-only models trained with clean data for JAMES paper "Machine-learned...

    • zenodo.org
    tar
    Updated Nov 8, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ryan Lagerquist; Ryan Lagerquist (2023). MME-only models trained with clean data for JAMES paper "Machine-learned uncertainty quantification is not magic" [Dataset]. http://doi.org/10.5281/zenodo.10084394
    Explore at:
    tarAvailable download formats
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ryan Lagerquist; Ryan Lagerquist
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This tar file contains all 100 trained models in the MME-only ensemble from Experiment 1 (i.e., those trained with clean data, not with lightly perturbed data). To read one of the models into Python, you can use the method neural_net.read_model in the ml4rt library.

  5. Z

    Data from: ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 27, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nagappan, Meiyappan (2022). ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5907001
    Explore at:
    Dataset updated
    Jan 27, 2022
    Dataset provided by
    Keshavarz, Hossein
    Nagappan, Meiyappan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction

    This archive contains the ApacheJIT dataset presented in the paper "ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction" as well as the replication package. The paper is submitted to MSR 2022 Data Showcase Track.

    The datasets are available under directory dataset. There are 4 datasets in this directory.

    1. apachejit_total.csv: This file contains the entire dataset. Commits are specified by their identifier and a set of commit metrics that are explained in the paper are provided as features. Column buggy specifies whether or not the commit introduced any bug into the system.
    2. apachejit_train.csv: This file is a subset of the entire dataset. It provides a balanced set that we recommend for models that are sensitive to class imbalance. This set is obtained from the first 14 years of data (2003 to 2016).
    3. apachejit_test_large.csv: This file is a subset of the entire dataset. The commits in this file are the commits from the last 3 years of data. This set is not balanced to represent a real-life scenario in a JIT model evaluation where the model is trained on historical data to be applied on future data without any modification.
    4. apachejit_test_small.csv: This file is a subset of the test file explained above. Since the test file has more than 30,000 commits, we also provide a smaller test set which is still unbalanced and from the last 3 years of data.

    In addition to the dataset, we also provide the scripts using which we built the dataset. These scripts are written in Python 3.8. Therefore, Python 3.8 or above is required. To set up the environment, we have provided a list of required packages in file requirements.txt. Additionally, one filtering step requires GumTree [1]. For Java, GumTree requires Java 11. For other languages, external tools are needed. Installation guide and more details can be found here.

    The scripts are comprised of Python scripts under directory src and Python notebooks under directory notebooks. The Python scripts are mainly responsible for conducting GitHub search via GitHub search API and collecting commits through PyDriller Package [2]. The notebooks link the fixed issue reports with their corresponding fixing commits and apply some filtering steps. The bug-inducing candidates then are filtered again using gumtree.py script that utilizes the GumTree package. Finally, the remaining bug-inducing candidates are combined with the clean commits in the dataset_construction notebook to form the entire dataset.

    More specifically, git_token.py handles GitHub API token that is necessary for requests to GitHub API. Script collector.py performs GitHub search. Tracing changed lines and git annotate is done in gitminer.py using PyDriller. Finally, gumtree.py applies 4 filtering steps (number of lines, number of files, language, and change significance).

    References:

    1. GumTree

    Jean-Rémy Falleri, Floréal Morandat, Xavier Blanc, Matias Martinez, and Martin Monperrus. 2014. Fine-grained and accurate source code differencing. In ACM/IEEE International Conference on Automated Software Engineering, ASE ’14,Vasteras, Sweden - September 15 - 19, 2014. 313–324

    1. PyDriller
    • https://pydriller.readthedocs.io/en/latest/

    • Davide Spadini, Maurício Aniche, and Alberto Bacchelli. 2018. PyDriller: Python Framework for Mining Software Repositories. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Lake Buena Vista, FL, USA)(ESEC/FSE2018). Association for Computing Machinery, New York, NY, USA, 908–911

  6. m

    Dataset for Cyberbullying Severe Classification

    • data.mendeley.com
    Updated Dec 20, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Idi Mohammed (2022). Dataset for Cyberbullying Severe Classification [Dataset]. http://doi.org/10.17632/4w6fcyzdfp.1
    Explore at:
    Dataset updated
    Dec 20, 2022
    Authors
    Idi Mohammed
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We retrieved 7,547 tweets from comment section of Nigerian topmost influential twitter handles from July 12, 2022 to September 22, 2022. The acquired datasets of comments were then cleaned, keeping only the text (remove User handle, URL, emotional signs, etc.) and filtered to remove duplicated comments using Python. The datasets consist of severe classification

  7. B

    Office Light Level

    • borealisdata.ca
    Updated Dec 7, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Y. Aussat; Costin Ograda-Bratu; S. Huo; S. Keshav (2020). Office Light Level [Dataset]. http://doi.org/10.5683/SP2/3BCADM
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 7, 2020
    Dataset provided by
    Borealis
    Authors
    Y. Aussat; Costin Ograda-Bratu; S. Huo; S. Keshav
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Time period covered
    Nov 16, 2020
    Description

    Data Owner: Y. Aussat, S. Keshav Data File: 32.8 MB zip file containing the data files and description Data Description: This dataset contains daylight signals collected over approximately 200 days in four unoccupied offices in the Davis Center building at the University of Waterloo. Thus, these measure the available daylight in the room. Light levels were measured using custom-built light sensing modules based on the Omega Onion microcomputer with a light sensor. An example of the module is shown in the file sensing-module.png in this directory. Each sensing module is named using four hex digits. We started all modules on August 30, 2018, which corresponds to minute 0 in the dataset. However, the modules were not deployed immediately. Below are the times when we started collecting the light data in each office and corresponding sensing module names. Office number Devices Start time DC3526 af65, b02d September 6, 2018, 11:00 am DC2518 afa7 September 6, 2018, 11:00 am DC2319 af67, f073 September 21, 2018, 11:00 am DC3502 afa5, b969 September 21, 2018, 11:00 am Moreover, due to some technical problems, the initial 6 days for offices 1 and 2 and initial 21 days for offices 3 and 4 are dummy data and should be ignored. Finally, there were two known outages in DC during the data collection process: from 00:00 AM to 4:00 AM on September 17, 2018 from 11:00pm on 10/9/2018 until 7:45am on October 10, 2018 We stopped collecting the data around 2:45 pm on May 16, 2019. Therefore, we have 217 uninterrupted days of clean collected data from October 11, 2018 to May 15, 2019. To take care of these problems, we have provided a python script process-lighting-data.ipynb that extracts clean data from the raw data. Both raw and processed data are provided as described next. Raw data: Raw data folder names correspond to the device names. The light sensing modules log (minute_count, visible_light, IR_light) every minute to a file. Here, minute 0 corresponds to August 30, 2018. Every 1440 minutes (i.e., 1 day) we saved the current file, created a new one, and started writing to it. The filename format is {device_name}_{starting_minute}. For example Omega-AF65_28800.csv is data collected by Omega-AF65, starting at minute 28800. A metadata file can also be found in each folder with the details of the log file structure. Processed data: The folder named ‘processed_data’ contains the processed data, which results from running the python script. Each file in this directory is named after the device ID, for example af65.csv stores the processed data of the device Omega-AF65. The columns in this file are: Minutes: Consecutive minute of the experiment Illum: Illumination level (lux) Min_from_midnight: Minutes from midnight of the current day Day_of_exp: Count of the day number starting from October 11, 2018 Day_of_year: Day of the year Funding: The Natural Sciences and Engineering Research Council of Canada (NSERC)

  8. c

    Data to Estimate Water Use Associated with Oil and Gas Development within...

    • s.cnmilf.com
    • data.usgs.gov
    • +1more
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Data to Estimate Water Use Associated with Oil and Gas Development within the Bureau of Land Management Carlsbad Field Office Area, New Mexico [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/data-to-estimate-water-use-associated-with-oil-and-gas-development-within-the-bureau-of-la
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    U.S. Geological Survey
    Area covered
    Carlsbad, New Mexico
    Description

    The purpose of this data release is to provide data in support of the Bureau of Land Management's (BLM) Reasonably Foreseeable Development (RFD) Scenario by estimating water-use associated with oil and gas extraction methods within the BLM Carlsbad Field Office (CFO) planning area, located in Eddy and Lea Counties as well as part of Chaves County, New Mexico. Three comma separated value files and two python scripts are included in this data release. It was determined that all reported oil and gas wells within Chaves County from the FracFocus and New Mexico Oil Conservation Division (NM OCD) databases were outside of the CFO administration area and were excluded from well_records.csv and modeled_estimates.csv. Data from Chaves County are included in the produced_water.csv file to be consistent with the BLM’s water support document. Data were synthesized into comma separated values which include, produced_water.csv (volume) from NM OCD, well_records.csv (including _location and completion) from NM OCD and FracFocus, and modeled_estimates.csv (using FracFocus as well as Ball and others (2020) as input data). The results from modeled_estimates.csv were obtained using a previously published regression model (McShane and McDowell, 2021) to estimate water use associated with unconventional oil and gas activities in the Permian Basin (Valder and others, 2021) for the period of interest (2010-2021). Additionally, python scripts to process, clean, and categorize FracFocus data are provided in this data release.

  9. O

    Python ETL Update Test Dataset

    • data.cambridgema.gov
    application/rdfxml +5
    Updated Mar 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Python ETL Update Test Dataset [Dataset]. https://data.cambridgema.gov/dataset/Python-ETL-Update-Test-Dataset/tqxn-z38b
    Explore at:
    application/rssxml, application/rdfxml, tsv, csv, xml, jsonAvailable download formats
    Dataset updated
    Mar 18, 2025
    Description

    Script we use to test the python ETL update process on milo. Keep it private, but please do not delete.

  10. Z

    Spatialized sorghum & millet yields in West Africa, derived from LSMS-ISA...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lavarenne, Jérémy (2024). Spatialized sorghum & millet yields in West Africa, derived from LSMS-ISA and RHoMIS datasets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10556265
    Explore at:
    Dataset updated
    Jul 7, 2024
    Dataset provided by
    Lavarenne, Jérémy
    Baboz, Eliott
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    West Africa, Africa
    Description

    Description: The dataset represents a significant effort to compile and clean a comprehensive set of seasonal yield data for sub-saharan West Africa (Benin, Burkina Faso, Mali, Niger). This dataset, overing more than 22,000 survey answers scattered across more than 2500 unique locations of smallholder producers’ households groups, is instrumental for researchers and policymakers working in agricultural planning and food security in the region. It integrates data from two sources, the LSMS-ISA program (link to the World Bank's site), and the RHoMIS dataset (link to RHoMIS files, RHoMIS' DOI).

    The construction of the dataset involved meticulous processes, including converting production into standardized unit, yield calculation for each dataset, standardization of column names, assembly of data, extensive data cleaning, and making it a hopefully robust and reliable resource for understanding spatial yield distribution in the region.

    Data Sources: The dataset comprises seven spatialized yield data sources, six of which are from the LSMS-ISA program (Mali 2014, Mali 2017, Mali 2018, Benin 2018, Burkina Faso 2018, Niger 2018) and one from the RHoMIS study (only Mali 2017 and Burkina Faso 2018 data selected).

    Dataset Preparation Methods: The preparation involved integration of machine-readable files, data cleaning and finalization using Python/Jupyter Notebook. This process should ensure the accuracy and consistency of the dataset. Yield have been calculated with declared production quantities and GPS-measured plot areas. Each yield value corresponds to a single plot.

    Discussion: This dataset, with its extensive data compilation, presents an invaluable resource for agricultural productivity-related studies in West Africa. However, users must navigate its complexities, including potential biases due to survey and due to UML units, and data inconsistencies. The dataset's comprehensive nature requires careful handling and validation in research applications.

    Authors Contributions:

    Data treatment: Eliott Baboz, Jérémy Lavarenne.

    Documentation: Jérémy Lavarenne.

    Funding: This project was funded by the INTEN-SAHEL TOSCA project (Centre national d’études spatiales). "123456789" was chosen randomly and is not the actual award number because there is none, but it was mandatory to put one here on Zenodo.

    Changelog:

    v1.0.0 : initial submission

  11. a

    GLOBE Observer Mosquito Habitat Mapper Citizen Science Data 2017-2020, v1

    • hub.arcgis.com
    Updated Apr 2, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Institute for Global Environmental Strategies (2021). GLOBE Observer Mosquito Habitat Mapper Citizen Science Data 2017-2020, v1 [Dataset]. https://hub.arcgis.com/documents/IGEStrategies::globe-observer-mosquito-habitat-mapper-citizen-science-data-2017-2020-v1
    Explore at:
    Dataset updated
    Apr 2, 2021
    Dataset authored and provided by
    Institute for Global Environmental Strategies
    Description

    Three Cases: Metadata and ProceduresThe data sets described here were used in an article submitted to the journal GeoHealth in 2021. The data files and further supplemental links (including general information about GLOBE data) can be accessed at https://observer.globe.gov/get-data/mosquito-habitat-data.Case 1: Removal of records with suspect geolocation data. A Python script was applied to remove records where the measured position (in decimal degrees) was identical to the GLOBE MGRS site position. GPS-obtained latitude and longitude coordinates are reported in decimal degrees, so records identified by whole numbers were also removed. This procedure removed 5704 (23%) of the 24983 records in the Mosquito Habitat Mapper database, with 19,279 records remaining. The secondary data sets cleaned only for geolocation anomalies were labeled Case 1.Case 2: Identifying suspected training events. For this test, we sought to identify groups of data that exceeded 10 records sharing these characteristics. Another Python script was employed to extract the photos for ease of visual inspection. Because we needed to manually review the photo records, we set the threshold for groups at >10, so that the analysis could be completed in the time allotted. Groups identified thought this procedure were outputted as case 2: groups. The resulting data set cleaned of groups >10 was labeled Case 2. The resulting data set included 20,006 records and identified 2,447 records found in clusters we postulated were training events.Case 3: The Case 3 secondary dataset result from applying the Python scripts used to create Cases 1 and 2. We used the Case 3 data sets, with improved geolocation and large groups eliminated, in the following analysis.The information in this description was last updated 2021-04-12

  12. Z

    Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Dec 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lignos, Dimitrios G. (2022). Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6965146
    Explore at:
    Dataset updated
    Dec 24, 2022
    Dataset provided by
    Hartloper, Alexander R.
    de Castro e Sousa, Albano
    Lignos, Dimitrios G.
    Ozden, Selimcan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials

    Background

    This dataset contains data from monotonic and cyclic loading experiments on structural metallic materials. The materials are primarily structural steels and one iron-based shape memory alloy is also included. Summary files are included that provide an overview of the database and data from the individual experiments is also included.

    The files included in the database are outlined below and the format of the files is briefly described. Additional information regarding the formatting can be found through the post-processing library (https://github.com/ahartloper/rlmtp/tree/master/protocols).

    Usage

    The data is licensed through the Creative Commons Attribution 4.0 International.

    If you have used our data and are publishing your work, we ask that you please reference both:

    this database through its DOI, and

    any publication that is associated with the experiments. See the Overall_Summary and Database_References files for the associated publication references.

    Included Files

    Overall_Summary_2022-08-25_v1-0-0.csv: summarises the specimen information for all experiments in the database.

    Summarized_Mechanical_Props_Campaign_2022-08-25_v1-0-0.csv: summarises the average initial yield stress and average initial elastic modulus per campaign.

    Unreduced_Data-#_v1-0-0.zip: contain the original (not downsampled) data

    Where # is one of: 1, 2, 3, 4, 5, 6. The unreduced data is broken into separate archives because of upload limitations to Zenodo. Together they provide all the experimental data.

    We recommend you un-zip all the folders and place them in one "Unreduced_Data" directory similar to the "Clean_Data"

    The experimental data is provided through .csv files for each test that contain the processed data. The experiments are organised by experimental campaign and named by load protocol and specimen. A .pdf file accompanies each test showing the stress-strain graph.

    There is a "db_tag_clean_data_map.csv" file that is used to map the database summary with the unreduced data.

    The computed yield stresses and elastic moduli are stored in the "yield_stress" directory.

    Clean_Data_v1-0-0.zip: contains all the downsampled data

    The experimental data is provided through .csv files for each test that contain the processed data. The experiments are organised by experimental campaign and named by load protocol and specimen. A .pdf file accompanies each test showing the stress-strain graph.

    There is a "db_tag_clean_data_map.csv" file that is used to map the database summary with the clean data.

    The computed yield stresses and elastic moduli are stored in the "yield_stress" directory.

    Database_References_v1-0-0.bib

    Contains a bibtex reference for many of the experiments in the database. Corresponds to the "citekey" entry in the summary files.

    File Format: Downsampled Data

    These are the "LP_

    The header of the first column is empty: the first column corresponds to the index of the sample point in the original (unreduced) data

    Time[s]: time in seconds since the start of the test

    e_true: true strain

    Sigma_true: true stress in MPa

    (optional) Temperature[C]: the surface temperature in degC

    These data files can be easily loaded using the pandas library in Python through:

    import pandas data = pandas.read_csv(data_file, index_col=0)

    The data is formatted so it can be used directly in RESSPyLab (https://github.com/AlbanoCastroSousa/RESSPyLab). Note that the column names "e_true" and "Sigma_true" were kept for backwards compatibility reasons with RESSPyLab.

    File Format: Unreduced Data

    These are the "LP_

    The first column is the index of each data point

    S/No: sample number recorded by the DAQ

    System Date: Date and time of sample

    Time[s]: time in seconds since the start of the test

    C_1_Force[kN]: load cell force

    C_1_Déform1[mm]: extensometer displacement

    C_1_Déplacement[mm]: cross-head displacement

    Eng_Stress[MPa]: engineering stress

    Eng_Strain[]: engineering strain

    e_true: true strain

    Sigma_true: true stress in MPa

    (optional) Temperature[C]: specimen surface temperature in degC

    The data can be loaded and used similarly to the downsampled data.

    File Format: Overall_Summary

    The overall summary file provides data on all the test specimens in the database. The columns include:

    hidden_index: internal reference ID

    grade: material grade

    spec: specifications for the material

    source: base material for the test specimen

    id: internal name for the specimen

    lp: load protocol

    size: type of specimen (M8, M12, M20)

    gage_length_mm_: unreduced section length in mm

    avg_reduced_dia_mm_: average measured diameter for the reduced section in mm

    avg_fractured_dia_top_mm_: average measured diameter of the top fracture surface in mm

    avg_fractured_dia_bot_mm_: average measured diameter of the bottom fracture surface in mm

    fy_n_mpa_: nominal yield stress

    fu_n_mpa_: nominal ultimate stress

    t_a_deg_c_: ambient temperature in degC

    date: date of test

    investigator: person(s) who conducted the test

    location: laboratory where test was conducted

    machine: setup used to conduct test

    pid_force_k_p, pid_force_t_i, pid_force_t_d: PID parameters for force control

    pid_disp_k_p, pid_disp_t_i, pid_disp_t_d: PID parameters for displacement control

    pid_extenso_k_p, pid_extenso_t_i, pid_extenso_t_d: PID parameters for extensometer control

    citekey: reference corresponding to the Database_References.bib file

    yield_stress_mpa_: computed yield stress in MPa

    elastic_modulus_mpa_: computed elastic modulus in MPa

    fracture_strain: computed average true strain across the fracture surface

    c,si,mn,p,s,n,cu,mo,ni,cr,v,nb,ti,al,b,zr,sn,ca,h,fe: chemical compositions in units of %mass

    file: file name of corresponding clean (downsampled) stress-strain data

    File Format: Summarized_Mechanical_Props_Campaign

    Meant to be loaded in Python as a pandas DataFrame with multi-indexing, e.g.,

    tab1 = pd.read_csv('Summarized_Mechanical_Props_Campaign_' + date + version + '.csv', index_col=[0, 1, 2, 3], skipinitialspace=True, header=[0, 1], keep_default_na=False, na_values='')

    citekey: reference in "Campaign_References.bib".

    Grade: material grade.

    Spec.: specifications (e.g., J2+N).

    Yield Stress [MPa]: initial yield stress in MPa

    size, count, mean, coefvar: number of experiments in campaign, number of experiments in mean, mean value for campaign, coefficient of variation for campaign

    Elastic Modulus [MPa]: initial elastic modulus in MPa

    size, count, mean, coefvar: number of experiments in campaign, number of experiments in mean, mean value for campaign, coefficient of variation for campaign

    Caveats

    The files in the following directories were tested before the protocol was established. Therefore, only the true stress-strain is available for each:

    A500

    A992_Gr50

    BCP325

    BCR295

    HYP400

    S460NL

    S690QL/25mm

    S355J2_Plates/S355J2_N_25mm and S355J2_N_50mm

  13. Z

    #PraCegoVer dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sandra Avila (2023). #PraCegoVer dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5710561
    Explore at:
    Dataset updated
    Jan 19, 2023
    Dataset provided by
    Sandra Avila
    Gabriel Oliveira dos Santos
    Esther Luna Colombini
    Description

    Automatically describing images using natural sentences is an essential task to visually impaired people's inclusion on the Internet. Although there are many datasets in the literature, most of them contain only English captions, whereas datasets with captions described in other languages are scarce.

    PraCegoVer arose on the Internet, stimulating users from social media to publish images, tag #PraCegoVer and add a short description of their content. Inspired by this movement, we have proposed the #PraCegoVer, a multi-modal dataset with Portuguese captions based on posts from Instagram. It is the first large dataset for image captioning in Portuguese with freely annotated images.

    PraCegoVer has 533,523 pairs with images and captions described in Portuguese collected from more than 14 thousand different profiles. Also, the average caption length in #PraCegoVer is 39.3 words and the standard deviation is 29.7.

    Dataset Structure

    PraCegoVer dataset is composed of the main file dataset.json and a collection of compressed files named images.tar.gz.partX

    containing the images. The file dataset.json comprehends a list of json objects with the attributes:

    user: anonymized user that made the post;

    filename: image file name;

    raw_caption: raw caption;

    caption: clean caption;

    date: post date.

    Each instance in dataset.json is associated with exactly one image in the images directory whose filename is pointed by the attribute filename. Also, we provide a sample with five instances, so the users can download the sample to get an overview of the dataset before downloading it completely.

    Download Instructions

    If you just want to have an overview of the dataset structure, you can download sample.tar.gz. But, if you want to use the dataset, or any of its subsets (63k and 173k), you must download all the files and run the following commands to uncompress and join the files:

    cat images.tar.gz.part* > images.tar.gz tar -xzvf images.tar.gz

    Alternatively, you can download the entire dataset from the terminal using the python script download_dataset.py available in PraCegoVer repository. In this case, first, you have to download the script and create an access token here. Then, you can run the following command to download and uncompress the image files:

    python download_dataset.py --access_token=

  14. d

    Industrial Machine Tool Element Surface Defect Dataset - Dataset - B2FIND

    • b2find.dkrz.de
    Updated Oct 24, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Industrial Machine Tool Element Surface Defect Dataset - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/63d20a5e-3584-5096-a34d-d3f93fcc8857
    Explore at:
    Dataset updated
    Oct 24, 2023
    Description

    Using Machine Learning Techniques in general and Deep Learning techniques in specific needs a certain amount of data often not available in large quantities in some technical domains. The manual inspection of Machine Tool Components, as well as the manual end of line check of products, are labour intensive tasks in industrial applications that often want to be automated by companies. To automate the classification processes and to develop reliable and robust Machine Learning based classification and wear prognostics models there is a need for real-world datasets to train and test models on. The dataset contains 1104 channel 3 images with 394 image-annotations for the surface damage type “pitting”. The annotations made with the annotation tool labelme, are available in JSON format and hence convertible to VOC and COCO format. All images come from two BSD types. The dataset available for download is divided into two folders, data with all images as JPEG, label with all annotations, and saved_model with a baseline model. The authors also provide a python script to divide the data and labels into three different split types – train_test_split, which splits images into the same train and test data-split the authors used for the baseline model, wear_dev_split, which creates all 27 wear developments and type_split, which splits the data into the occurring BSD-types. One of the two mentioned BSD types is represented with 69 images and 55 different image-sizes. All images with this BSD type come either in a clean or soiled condition. The other BSD type is shown on 325 images with two image-sizes. Since all images of this type have been taken with continuous time the degree of soiling is evolving. Also, the dataset contains as above mentioned 27 pitting development sequences with every 69 images. Instruction dataset split The authors of this dataset provide 3 types of different dataset splits. To get the data split you have to run the python script split_dataset.py. Script inputs: split-type (mandatory) output directory (mandatory) Different split-types: train_test_split: splits dataset into train and test data (80%/20%) wear_dev_split: splits dataset into 27 wear-developments type_split: splits dataset into different BSD types Example: C:\Users\Desktop>python split_dataset.py --split_type=train_test_split --output_dir=BSD_split_folder

  15. c

    Electrification of Heat Demonstration Project: Heat Pump Performance...

    • datacatalogue.cessda.eu
    • beta.ukdataservice.ac.uk
    Updated Dec 20, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Energy Systems Catapult (2024). Electrification of Heat Demonstration Project: Heat Pump Performance Cleansed Data, 2020-2023 [Dataset]. http://doi.org/10.5255/UKDA-SN-9050-2
    Explore at:
    Dataset updated
    Dec 20, 2024
    Authors
    Energy Systems Catapult
    Time period covered
    Nov 1, 2020 - Sep 29, 2023
    Area covered
    United Kingdom
    Variables measured
    Families/households, Subnational
    Measurement technique
    Measurements and tests
    Description

    Abstract copyright UK Data Service and data collection copyright owner.


    The heat pump monitoring datasets are a key output of the Electrification of Heat Demonstration (EoH) project, a government-funded heat pump trial assessing the feasibility of heat pumps across the UK’s diverse housing stock. These datasets are provided in both cleansed and raw form and allow analysis of the initial performance of the heat pumps installed in the trial. From the datasets, insights such as heat pump seasonal performance factor (a measure of the heat pump's efficiency), heat pump performance during the coldest day of the year, and half-hourly performance to inform peak demand can be gleaned.

    For the second edition (December 2024), the data were updated to include cleaned performance data collected between November 2020 and September 2023. The only documentation currently available with the study is the Excel data dictionary. Reports and other contextual information can be found on the Energy Systems Catapult website.

    The EoH project was funded by the Department of Business, Energy and Industrial Strategy. From 2023, it is covered by the new Department for Energy Security and Net Zero.

    Data availability

    This study comprises the open-access cleansed data from the EoH project and a summary dataset, available in four zipped files (see the 'Access Data' tab). Users must download all four zip files to obtain the full set of cleansed data and accompanying documentation.

    When unzipped, the full cleansed data comprises 742 CSV files. Most of the individual CSV files are too large to open in Excel. Users should ensure they have sufficient computing facilities to analyse the data.

    The UKDS also holds an accompanying study, SN 9049 Electrification of Heat Demonstration Project: Heat Pump Performance Raw Data, 2020-2023, which is available only to registered UKDS users. This contains the raw data from the EoH project. Since the data are very large, only the summary dataset is available to download; an order must be placed for FTP delivery of the remaining raw data. Other studies in the set include SN 9209, which comprises 30-minute interval heat pump performance data, and SN 9210, which includes daily heat pump performance data.

    The Python code used to cleanse the raw data and then perform the analysis is accessible via the Energy Systems Catapult Github


    Main Topics:

    Heat Pump Performance across the BEIS funded heat pump trial, The Electrification of Heat (EoH) Demonstration Project. See the documentation for data contents.

  16. T

    wikipedia

    • tensorflow.org
    • huggingface.co
    Updated Aug 9, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2019). wikipedia [Dataset]. https://www.tensorflow.org/datasets/catalog/wikipedia
    Explore at:
    Dataset updated
    Aug 9, 2019
    Description

    Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('wikipedia', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  17. d

    Data from: The MTPy software package for magnetotelluric data analysis and...

    • data.gov.au
    html
    Updated Jul 15, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Commonwealth of Australia (Geoscience Australia) (2021). The MTPy software package for magnetotelluric data analysis and visualisation [Dataset]. https://data.gov.au/dataset/ds-ga-b3b308de-7126-4ef5-ba12-b1da470be63d
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Jul 15, 2021
    Dataset provided by
    Commonwealth of Australia (Geoscience Australia)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The magnetotelluric (MT) method is increasingly being applied to a wide variety of geoscience problems. However, the software available for MT data analysis and interpretation is still very limited …Show full descriptionThe magnetotelluric (MT) method is increasingly being applied to a wide variety of geoscience problems. However, the software available for MT data analysis and interpretation is still very limited in comparison to many of the more mature geophysical methods such as the gravity, magnetic or seismic reflection methods. MTPy is an open source Python package to assist with MT data processing, analysis, modelling, visualization and interpretation. It was initiated at the University of Adelaide in 2013 as a means to store and share Python code amongst the MT community (Krieger and Peacock 2014). Here we provide an overview of the software and describe recent developments to MTPy. These include new functionality and a clean up and standardisation of the source code, as well as the addition of an integrated testing suite, documentation, and examples in order to facilitate the use of MT in the wider geophysics community.

  18. d

    Data from: Assessing the Impacts of Pythons in the Greater Everglades:...

    • datadiscoverystudio.org
    • search.dataone.org
    • +2more
    Updated May 20, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). Assessing the Impacts of Pythons in the Greater Everglades: Examination of Diet and Thermal Biology of Python molurus bivittatus. [Dataset]. http://datadiscoverystudio.org/geoportal/rest/metadata/item/2a25e7bfa1224bcea8935ede5f234d02/html
    Explore at:
    Dataset updated
    May 20, 2018
    Area covered
    Everglades
    Description

    description: The Burmese python (Python molurus bivittatus), a native to Southeast Asia, can reach a length greater than twenty feet (Wall 1921, Pope 1961). This python is a long lived (15 - 25 years) behavioral, habitat, and dietary generalist, capable of producing large clutches of eggs (8 - 107) (Lederer 1956, Branch and Erasmus 1984). Observations of Burmese pythons exist in the United States primarily from locations within Everglades National Park (ENP), including; along the Main Park Road in the saline and freshwater glades, and mangroves, between Pay-hay-okee and Flamingo, the greater Long Pine Key area (including Hole-in-the-Donut), and the greater Shark Valley area along the Tamiami Trail (including L-67 Ext.). The non-native species has also been observed repeatedly on the eastern boundary of ENP, along canal levees, in the remote mangrove backcountry, and in Big Cypress National Preserve. From 2002 (when the numbers first began to climb) to 2005, 201 pythons were captured and removed or found dead. In 2006-2007 alone, that number more than doubled to 418. Measured total length for snakes recovered ranged from 0.5 m to 4.5 m including five hatchling-sized animals recovered in the summer of 2004, and two hatchlings captured in 2005. In 2008, 343 pythons were removed, and so far in 2009, 347 individuals have been removed. The non-native semi-aquatic pythons's diet in southern Florida includes raccoon, rabbit, muskrat, squirrel, opossum, cotton rat, black rat, bobcat, house wren, pied-billed grebe, white ibis, limpkin, alligator and endangered Key Largo wood rat. As Python molurus is known to eat birds, and also known to frequent wading bird colonies in their native range, the proximity of python sightings to the Paurotis Pond and Tamiami West wood stork rookeries is troubling. The potential for pythons to eat Mangrove Fox Squirrels and Cape Sable Seaside Sparrows and to compete with Indigos Snakes is also of concern. Burmese Pythons present a potential threat to successful ecological restoration of the greater Everglades (NRC 2005). Pythons are now established and breeding in South Florida. Python molurus bivittatus has the potential to occupy the entire footprint of the Comprehensive Everglades Restoration Project (CERP), adversely impacting valued resources across the landscape. Proposed management and control actions must include research strategies and further evaluation of potential impacts of pythons. The results of this project will be applied to develop a comprehensive, science-based control and containment program. The proposed project will also increase our understanding of the impacts of Burmese pythons on native fauna in DOI and surrounding lands. Dealing with established exotic species requires that we understand their status and impacts, and how to remove them. A current priority item for determining status is finding out the extent of invasion by established species. Once we know where the threat is occurring, we need a better understanding of how the threat may manifest itself ecologically-that is, what are the impacts of invasion? We can hypothesize that Burmese pythons compete with native snakes or affect populations of prey species; however, knowing with certainty that pythons eat wood rats, for example, better focuses eradication efforts and spurs action. A study of diet of Burmese pythons directly addresses this issue. Further, knowing how much pythons eat through a bioenergetic model allows us to forecast with more certainty predation impacts on native fauna.; abstract: The Burmese python (Python molurus bivittatus), a native to Southeast Asia, can reach a length greater than twenty feet (Wall 1921, Pope 1961). This python is a long lived (15 - 25 years) behavioral, habitat, and dietary generalist, capable of producing large clutches of eggs (8 - 107) (Lederer 1956, Branch and Erasmus 1984). Observations of Burmese pythons exist in the United States primarily from locations within Everglades National Park (ENP), including; along the Main Park Road in the saline and freshwater glades, and mangroves, between Pay-hay-okee and Flamingo, the greater Long Pine Key area (including Hole-in-the-Donut), and the greater Shark Valley area along the Tamiami Trail (including L-67 Ext.). The non-native species has also been observed repeatedly on the eastern boundary of ENP, along canal levees, in the remote mangrove backcountry, and in Big Cypress National Preserve. From 2002 (when the numbers first began to climb) to 2005, 201 pythons were captured and removed or found dead. In 2006-2007 alone, that number more than doubled to 418. Measured total length for snakes recovered ranged from 0.5 m to 4.5 m including five hatchling-sized animals recovered in the summer of 2004, and two hatchlings captured in 2005. In 2008, 343 pythons were removed, and so far in 2009, 347 individuals have been removed. The non-native semi-aquatic pythons's diet in southern Florida includes raccoon, rabbit, muskrat, squirrel, opossum, cotton rat, black rat, bobcat, house wren, pied-billed grebe, white ibis, limpkin, alligator and endangered Key Largo wood rat. As Python molurus is known to eat birds, and also known to frequent wading bird colonies in their native range, the proximity of python sightings to the Paurotis Pond and Tamiami West wood stork rookeries is troubling. The potential for pythons to eat Mangrove Fox Squirrels and Cape Sable Seaside Sparrows and to compete with Indigos Snakes is also of concern. Burmese Pythons present a potential threat to successful ecological restoration of the greater Everglades (NRC 2005). Pythons are now established and breeding in South Florida. Python molurus bivittatus has the potential to occupy the entire footprint of the Comprehensive Everglades Restoration Project (CERP), adversely impacting valued resources across the landscape. Proposed management and control actions must include research strategies and further evaluation of potential impacts of pythons. The results of this project will be applied to develop a comprehensive, science-based control and containment program. The proposed project will also increase our understanding of the impacts of Burmese pythons on native fauna in DOI and surrounding lands. Dealing with established exotic species requires that we understand their status and impacts, and how to remove them. A current priority item for determining status is finding out the extent of invasion by established species. Once we know where the threat is occurring, we need a better understanding of how the threat may manifest itself ecologically-that is, what are the impacts of invasion? We can hypothesize that Burmese pythons compete with native snakes or affect populations of prey species; however, knowing with certainty that pythons eat wood rats, for example, better focuses eradication efforts and spurs action. A study of diet of Burmese pythons directly addresses this issue. Further, knowing how much pythons eat through a bioenergetic model allows us to forecast with more certainty predation impacts on native fauna.

  19. U.S. Historical Climate - Monthly Averages for GHCN-D Stations for 1981 -...

    • climate-arcgis-content.hub.arcgis.com
    • community-climatesolutions.hub.arcgis.com
    Updated Apr 16, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Esri (2019). U.S. Historical Climate - Monthly Averages for GHCN-D Stations for 1981 - 2010 [Dataset]. https://climate-arcgis-content.hub.arcgis.com/datasets/esri::u-s-historical-climate-monthly-averages-for-ghcn-d-stations-for-1981-2010
    Explore at:
    Dataset updated
    Apr 16, 2019
    Dataset authored and provided by
    Esrihttp://esri.com/
    Area covered
    Pacific Ocean, North Pacific Ocean
    Description

    This point layer contains monthly summaries of daily temperatures (means, minimums, and maximums) and precipitation levels (sum, lowest, and highest) for the period January 1981 through December 2010 for weather stations in the Global Historical Climate Network Daily (GHCND). Data in this service were obtained from web services hosted by the Applied Climate Information System ( ACIS). ACIS staff curate the values for the U.S., including correcting erroneous values, reconciling data from stations that have been moved over their history, etc. The data were compiled at Esri from publicly available sources hosted and administered by NOAA. Because the ACIS data is updated and corrected on an ongoing basis, the date of collection for this layer was Jan 23, 2019. The following process was used to produce this dataset:Download the most current list of stations from ftp.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt. Import this into Microsoft Excel and save as CSV. In ArcGIS, import the CSV as a geodatabase table and use the XY Event layer tool to locate each point. Using a detailed U.S. boundary extract the points that fall within the 50 U.S. States, the District of Columbia, and Puerto Rico. Using Python with DA.UpdateCursor and urllib2 access the ACIS Web Services API to determine whether each station had at least 50 monthly values of temperature data for each station. Delete the other stations. Using Python add the necessary field names and acquire all monthly values for the remaining stations. Thus, there are stations that have some missing data. Using Python Add fields and convert the standard values to metric values so both would be present. Thus, there are four sets of monthly data in this dataset: Monthly means, mins, and maxes of daily temperatures - degrees Fahrenheit. Monthly mean of monthly sums of precipitation and the level of precipitation that was the minimum and maximum during the period 1981 to 2010 - mm. Temperatures in 3a. in degrees Celcius. Precipitation levels in 3b in Inches. After initially publishing these data in a different service, it was learned that more precise coordinates for station locations were available from the Enhanced Master Station History Report (EMSHR) published by NOAA NCDC. With the publication of this layer these most precise coordinates are used. A large subset of the EMSHR metadata is available via EMSHR Stations Locations and Metadata 1738 to Present. If your study area includes areas outside of the U.S., use the World Historical Climate - Monthly Averages for GHCN-D Stations 1981 - 2010 layer. The data in this layer come from the same source archive, however, they are not curated by the ACIS staff and may contain errors. Revision History: Initially Published: 23 Jan 2019 Updated 16 Apr 2019 - We learned more precise coordinates for station locations were available from the Enhanced Master Station History Report (EMSHR) published by NOAA NCDC. With the publication of this layer the geometry and attributes for 3,222 of 9,636 stations now have more precise coordinates. The schema was updated to include the NCDC station identifier and elevation fields for feet and meters are also included. A large subset of the EMSHR data is available via EMSHR Stations Locations and Metadata 1738 to Present. Cite as: Esri, 2019: U.S. Historical Climate - Monthly Averages for GHCN-D Stations for 1981 - 2010. ArcGIS Online, Accessed

  20. Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

    • zenodo.org
    application/gzip, bin +2
    Updated Aug 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb (2024). Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem [Dataset]. http://doi.org/10.5281/zenodo.1419788
    Explore at:
    bin, application/gzip, zip, text/x-pythonAvailable download formats
    Dataset updated
    Aug 2, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb
    License

    https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html

    Description
    Replication pack, FSE2018 submission #164:
    ------------------------------------------
    
    **Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: 
    A Case Study of the PyPI Ecosystem
    
    **Note:** link to data artifacts is already included in the paper. 
    Link to the code will be included in the Camera Ready version as well.
    
    
    Content description
    ===================
    
    - **ghd-0.1.0.zip** - the code archive. This code produces the dataset files 
     described below
    - **settings.py** - settings template for the code archive.
    - **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset.
     This dataset only includes stats aggregated by the ecosystem (PyPI)
    - **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level
     statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages
     themselves, which take around 2TB.
    - **build_model.r, helpers.r** - R files to process the survival data 
      (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, 
      `common.cache/survival_data.pypi_2008_2017-12_6.csv` in 
      **dataset_full_Jan_2018.tgz**)
    - **Interview protocol.pdf** - approximate protocol used for semistructured interviews.
    - LICENSE - text of GPL v3, under which this dataset is published
    - INSTALL.md - replication guide (~2 pages)
    Replication guide
    =================
    
    Step 0 - prerequisites
    ----------------------
    
    - Unix-compatible OS (Linux or OS X)
    - Python interpreter (2.7 was used; Python 3 compatibility is highly likely)
    - R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible)
    
    Depending on detalization level (see Step 2 for more details):
    - up to 2Tb of disk space (see Step 2 detalization levels)
    - at least 16Gb of RAM (64 preferable)
    - few hours to few month of processing time
    
    Step 1 - software
    ----------------
    
    - unpack **ghd-0.1.0.zip**, or clone from gitlab:
    
       git clone https://gitlab.com/user2589/ghd.git
       git checkout 0.1.0
     
     `cd` into the extracted folder. 
     All commands below assume it as a current directory.
      
    - copy `settings.py` into the extracted folder. Edit the file:
      * set `DATASET_PATH` to some newly created folder path
      * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` 
    - install docker. For Ubuntu Linux, the command is 
      `sudo apt-get install docker-compose`
    - install libarchive and headers: `sudo apt-get install libarchive-dev`
    - (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools`
     Without this dependency, you might get an error on the next step, 
     but it's safe to ignore.
    - install Python libraries: `pip install --user -r requirements.txt` . 
    - disable all APIs except GitHub (Bitbucket and Gitlab support were
     not yet implemented when this study was in progress): edit
     `scraper/init.py`, comment out everything except GitHub support
     in `PROVIDERS`.
    
    Step 2 - obtaining the dataset
    -----------------------------
    
    The ultimate goal of this step is to get output of the Python function 
    `common.utils.survival_data()` and save it into a CSV file:
    
      # copy and paste into a Python console
      from common import utils
      survival_data = utils.survival_data('pypi', '2008', smoothing=6)
      survival_data.to_csv('survival_data.csv')
    
    Since full replication will take several months, here are some ways to speedup
    the process:
    
    ####Option 2.a, difficulty level: easiest
    
    Just use the precomputed data. Step 1 is not necessary under this scenario.
    
    - extract **dataset_minimal_Jan_2018.zip**
    - get `survival_data.csv`, go to the next step
    
    ####Option 2.b, difficulty level: easy
    
    Use precomputed longitudinal feature values to build the final table.
    The whole process will take 15..30 minutes.
    
    - create a folder `
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Negar Alinaghi; Ioannis Giannopoulos; Ioannis Giannopoulos; Negar Alinaghi; Negar Alinaghi; Negar Alinaghi (2025). Decoding Wayfinding: Analyzing Wayfinding Processes in the Outdoor Environment [Dataset]. http://doi.org/10.48436/m2ha4-t1v92

Data from: Decoding Wayfinding: Analyzing Wayfinding Processes in the Outdoor Environment

Related Article
Explore at:
html, zip, pdfAvailable download formats
Dataset updated
Feb 23, 2025
Dataset provided by
TU Wien
Authors
Negar Alinaghi; Ioannis Giannopoulos; Ioannis Giannopoulos; Negar Alinaghi; Negar Alinaghi; Negar Alinaghi
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Folder Structure

The folder named “submission” contains the following:

  1. “pythonProject”: This folder contains all the Python files and subfolders needed for analysis.
  2. ijgis.yml: This file lists all the Python libraries and dependencies required to run the code.

Setting Up the Environment

  1. Use the ijgis.yml file to create a Python project and environment. Ensure you activate the environment before running the code.
  2. The pythonProject folder contains several .py files and subfolders, each with specific functionality as described below.

Subfolders

1. Data_4_IJGIS

  • This folder contains the data used for the results reported in the paper.
  • Note: The data analysis that we explain in this paper already begins with the synchronization and cleaning of the recorded raw data. The published data is already synchronized and cleaned. Both the cleaned files and the merged files with features extracted for them are given in this directory. If you want to perform the segmentation and feature extraction yourself, you should run the respective Python files yourself. If not, you can use the “merged_…csv” files as input for the training.

2. results_[DateTime] (e.g., results_20240906_15_00_13)

  • This folder will be generated when you run the code and will store the output of each step.
  • The current folder contains results created during code debugging for the submission.
  • When you run the code, a new folder with fresh results will be generated.

Python Files

1. helper_functions.py

  • Contains reusable functions used throughout the analysis.
  • Each function includes a description of its purpose and the input parameters required.

2. create_sanity_plots.py

  • Generates scatter plots like those in Figure 3 of the paper.
  • Although the code has been run for all 309 trials, it can be used to check the sample data provided.
  • Output: A .png file for each column of the raw gaze and IMU recordings, color-coded with logged events.
  • Usage: Run this file to create visualizations similar to Figure 3.

3. overlapping_sliding_window_loop.py

  • Implements overlapping sliding window segmentation and generates plots like those in Figure 4.
  • Output:
    • Two new subfolders, “Gaze” and “IMU”, will be added to the Data_4_IJGIS folder.
    • Segmented files (default: 2–10 seconds with a 1-second step size) will be saved as .csv files.
    • A visualization of the segments, similar to Figure 4, will be automatically generated.

4. gaze_features.py & imu_features.py

  • These files compute features as explained in Tables 1 and 2 of the paper, respectively.
  • They process the segmented recordings generated by the overlapping_sliding_window_loop.py.
  • Usage: Just to know how the features are calculated, you can run this code after the segmentation with the sliding window and run these files to calculate the features from the segmented data.

5. training_prediction.py

  • This file contains the main machine learning analysis of the paper. This file contains all the code for the training of the model, its evaluation, and its use for the inference of the “monitoring part”. It covers the following steps:
a. Data Preparation (corresponding to Section 5.1.1 of the paper)
  • Prepares the data according to the research question (RQ) described in the paper. Since this data was collected with several RQs in mind, we remove parts of the data that are not related to the RQ of this paper.
  • A function named plot_labels_comparison(df, save_path, x_label_freq=10, figsize=(15, 5)) in line 116 visualizes the data preparation results. As this visualization is not used in the paper, the line is commented out, but if you want to see visually what has been changed compared to the original data, you can comment out this line.
b. Training/Validation/Test Split
  • Splits the data for machine learning experiments (an explanation can be found in Section 5.1.1. Preparation of data for training and inference of the paper).
  • Make sure that you follow the instructions in the comments to the code exactly.
  • Output: The split data is saved as .csv files in the results folder.
c. Machine and Deep Learning Experiments

This part contains three main code blocks:

iii. One for the XGboost code with correct hyperparameter tuning:
Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically test the confidence threshold of

  • MLP Network (Commented Out): This code was used for classification with the MLP network, and the results shown in Table 3 are from this code. If you wish to use this model, please comment out the following blocks accordingly.
  • XGBoost without Hyperparameter Tuning: If you want to run the code but do not want to spend time on the full training with hyperparameter tuning (as was done for the paper), just uncomment this part. This will give you a simple, untuned model with which you can achieve at least some results.
  • XGBoost with Hyperparameter Tuning: If you want to train the model the way we trained it for the analysis reported in the paper, use this block (the plots in Figure 7 are from this block). We ran this block with different feature sets and different segmentation files and created a simple bar chart from the saved results, shown in Figure 6.

Note: Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically calculated the confidence threshold of the model (explained in the paper in Section 5.2. Part II: Decoding surveillance by sequence analysis) is given in this block in lines 361 to 380.

d. Inference (Monitoring Part)
  • Final inference is performed using the monitoring data. This step produces a .csv file containing inferred labels.
  • Figure 8 in the paper is generated using this part of the code.

6. sequence_analysis.py

  • Performs analysis on the inferred data, producing Figures 9 and 10 from the paper.
  • This file reads the inferred data from the previous step and performs sequence analysis as described in Sections 5.2.1 and 5.2.2.

Licenses

The data is licensed under CC-BY, the code is licensed under MIT.

Search
Clear search
Close search
Google apps
Main menu