Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The folder named “submission” contains the following:
ijgis.yml
: This file lists all the Python libraries and dependencies required to run the code.ijgis.yml
file to create a Python project and environment. Ensure you activate the environment before running the code.pythonProject
folder contains several .py
files and subfolders, each with specific functionality as described below..png
file for each column of the raw gaze and IMU recordings, color-coded with logged events..csv
files.overlapping_sliding_window_loop.py
.plot_labels_comparison(df, save_path, x_label_freq=10, figsize=(15, 5))
in line 116 visualizes the data preparation results. As this visualization is not used in the paper, the line is commented out, but if you want to see visually what has been changed compared to the original data, you can comment out this line..csv
files in the results folder.This part contains three main code blocks:
iii. One for the XGboost code with correct hyperparameter tuning:
Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically test the confidence threshold of
Note: Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically calculated the confidence threshold of the model (explained in the paper in Section 5.2. Part II: Decoding surveillance by sequence analysis) is given in this block in lines 361 to 380.
.csv
file containing inferred labels.The data is licensed under CC-BY, the code is licensed under MIT.
Overview
This repository contains ready-to-use frequency time series as well as the corresponding pre-processing scripts in python. The data covers three synchronous areas of the European power grid:
This work is part of the paper "Predictability of Power Grid Frequency"[1]. Please cite this paper, when using the data and the code. For a detailed documentation of the pre-processing procedure we refer to the supplementary material of the paper.
Data sources
We downloaded the frequency recordings from publically available repositories of three different Transmission System Operators (TSOs).
Content of the repository
A) Scripts
The python scripts run with Python 3.7 and with the packages found in "requirements.txt".
B) Yearly converted and cleansed data
The folders "
Use cases
We point out that this repository can be used in two different was:
Use pre-processed data: You can directly use the converted or the cleansed data. Note however, that both data sets include segments of NaN-values due to missing and corrupted recordings. Only a very small part of the NaN-values were eliminated in the cleansed data to not manipulate the data too much.
Produce your own cleansed data: Depending on your application, you might want to cleanse the data in a custom way. You can easily add your custom cleansing procedure in "clean_corrupted_data.py" and then produce cleansed data from the raw data in "
License
This work is licensed under multiple licenses, which are located in the "LICENSES" folder.
Changelog
Version 2:
Version 3:
The dataset is gathered on Sep. 17th 2020 from GitHub. It has more than 5.2K Python repositories and 4.2M type annotations. The dataset is also de-duplicated using the CD4Py tool. Check out the README.MD file for the description of the dataset. Notable changes to each version of the dataset are documented in CHANGELOG.md. The dataset's scripts and utilities are available on its GitHub repository.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This tar file contains all 100 trained models in the MME-only ensemble from Experiment 1 (i.e., those trained with clean data, not with lightly perturbed data). To read one of the models into Python, you can use the method neural_net.read_model in the ml4rt library.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction
This archive contains the ApacheJIT dataset presented in the paper "ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction" as well as the replication package. The paper is submitted to MSR 2022 Data Showcase Track.
The datasets are available under directory dataset. There are 4 datasets in this directory.
In addition to the dataset, we also provide the scripts using which we built the dataset. These scripts are written in Python 3.8. Therefore, Python 3.8 or above is required. To set up the environment, we have provided a list of required packages in file requirements.txt. Additionally, one filtering step requires GumTree [1]. For Java, GumTree requires Java 11. For other languages, external tools are needed. Installation guide and more details can be found here.
The scripts are comprised of Python scripts under directory src and Python notebooks under directory notebooks. The Python scripts are mainly responsible for conducting GitHub search via GitHub search API and collecting commits through PyDriller Package [2]. The notebooks link the fixed issue reports with their corresponding fixing commits and apply some filtering steps. The bug-inducing candidates then are filtered again using gumtree.py script that utilizes the GumTree package. Finally, the remaining bug-inducing candidates are combined with the clean commits in the dataset_construction notebook to form the entire dataset.
More specifically, git_token.py handles GitHub API token that is necessary for requests to GitHub API. Script collector.py performs GitHub search. Tracing changed lines and git annotate is done in gitminer.py using PyDriller. Finally, gumtree.py applies 4 filtering steps (number of lines, number of files, language, and change significance).
References:
Jean-Rémy Falleri, Floréal Morandat, Xavier Blanc, Matias Martinez, and Martin Monperrus. 2014. Fine-grained and accurate source code differencing. In ACM/IEEE International Conference on Automated Software Engineering, ASE ’14,Vasteras, Sweden - September 15 - 19, 2014. 313–324
Davide Spadini, Maurício Aniche, and Alberto Bacchelli. 2018. PyDriller: Python Framework for Mining Software Repositories. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Lake Buena Vista, FL, USA)(ESEC/FSE2018). Association for Computing Machinery, New York, NY, USA, 908–911
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We retrieved 7,547 tweets from comment section of Nigerian topmost influential twitter handles from July 12, 2022 to September 22, 2022. The acquired datasets of comments were then cleaned, keeping only the text (remove User handle, URL, emotional signs, etc.) and filtered to remove duplicated comments using Python. The datasets consist of severe classification
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Data Owner: Y. Aussat, S. Keshav Data File: 32.8 MB zip file containing the data files and description Data Description: This dataset contains daylight signals collected over approximately 200 days in four unoccupied offices in the Davis Center building at the University of Waterloo. Thus, these measure the available daylight in the room. Light levels were measured using custom-built light sensing modules based on the Omega Onion microcomputer with a light sensor. An example of the module is shown in the file sensing-module.png in this directory. Each sensing module is named using four hex digits. We started all modules on August 30, 2018, which corresponds to minute 0 in the dataset. However, the modules were not deployed immediately. Below are the times when we started collecting the light data in each office and corresponding sensing module names. Office number Devices Start time DC3526 af65, b02d September 6, 2018, 11:00 am DC2518 afa7 September 6, 2018, 11:00 am DC2319 af67, f073 September 21, 2018, 11:00 am DC3502 afa5, b969 September 21, 2018, 11:00 am Moreover, due to some technical problems, the initial 6 days for offices 1 and 2 and initial 21 days for offices 3 and 4 are dummy data and should be ignored. Finally, there were two known outages in DC during the data collection process: from 00:00 AM to 4:00 AM on September 17, 2018 from 11:00pm on 10/9/2018 until 7:45am on October 10, 2018 We stopped collecting the data around 2:45 pm on May 16, 2019. Therefore, we have 217 uninterrupted days of clean collected data from October 11, 2018 to May 15, 2019. To take care of these problems, we have provided a python script process-lighting-data.ipynb that extracts clean data from the raw data. Both raw and processed data are provided as described next. Raw data: Raw data folder names correspond to the device names. The light sensing modules log (minute_count, visible_light, IR_light) every minute to a file. Here, minute 0 corresponds to August 30, 2018. Every 1440 minutes (i.e., 1 day) we saved the current file, created a new one, and started writing to it. The filename format is {device_name}_{starting_minute}. For example Omega-AF65_28800.csv is data collected by Omega-AF65, starting at minute 28800. A metadata file can also be found in each folder with the details of the log file structure. Processed data: The folder named ‘processed_data’ contains the processed data, which results from running the python script. Each file in this directory is named after the device ID, for example af65.csv stores the processed data of the device Omega-AF65. The columns in this file are: Minutes: Consecutive minute of the experiment Illum: Illumination level (lux) Min_from_midnight: Minutes from midnight of the current day Day_of_exp: Count of the day number starting from October 11, 2018 Day_of_year: Day of the year Funding: The Natural Sciences and Engineering Research Council of Canada (NSERC)
The purpose of this data release is to provide data in support of the Bureau of Land Management's (BLM) Reasonably Foreseeable Development (RFD) Scenario by estimating water-use associated with oil and gas extraction methods within the BLM Carlsbad Field Office (CFO) planning area, located in Eddy and Lea Counties as well as part of Chaves County, New Mexico. Three comma separated value files and two python scripts are included in this data release. It was determined that all reported oil and gas wells within Chaves County from the FracFocus and New Mexico Oil Conservation Division (NM OCD) databases were outside of the CFO administration area and were excluded from well_records.csv and modeled_estimates.csv. Data from Chaves County are included in the produced_water.csv file to be consistent with the BLM’s water support document. Data were synthesized into comma separated values which include, produced_water.csv (volume) from NM OCD, well_records.csv (including _location and completion) from NM OCD and FracFocus, and modeled_estimates.csv (using FracFocus as well as Ball and others (2020) as input data). The results from modeled_estimates.csv were obtained using a previously published regression model (McShane and McDowell, 2021) to estimate water use associated with unconventional oil and gas activities in the Permian Basin (Valder and others, 2021) for the period of interest (2010-2021). Additionally, python scripts to process, clean, and categorize FracFocus data are provided in this data release.
Script we use to test the python ETL update process on milo. Keep it private, but please do not delete.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description: The dataset represents a significant effort to compile and clean a comprehensive set of seasonal yield data for sub-saharan West Africa (Benin, Burkina Faso, Mali, Niger). This dataset, overing more than 22,000 survey answers scattered across more than 2500 unique locations of smallholder producers’ households groups, is instrumental for researchers and policymakers working in agricultural planning and food security in the region. It integrates data from two sources, the LSMS-ISA program (link to the World Bank's site), and the RHoMIS dataset (link to RHoMIS files, RHoMIS' DOI).
The construction of the dataset involved meticulous processes, including converting production into standardized unit, yield calculation for each dataset, standardization of column names, assembly of data, extensive data cleaning, and making it a hopefully robust and reliable resource for understanding spatial yield distribution in the region.
Data Sources: The dataset comprises seven spatialized yield data sources, six of which are from the LSMS-ISA program (Mali 2014, Mali 2017, Mali 2018, Benin 2018, Burkina Faso 2018, Niger 2018) and one from the RHoMIS study (only Mali 2017 and Burkina Faso 2018 data selected).
Dataset Preparation Methods: The preparation involved integration of machine-readable files, data cleaning and finalization using Python/Jupyter Notebook. This process should ensure the accuracy and consistency of the dataset. Yield have been calculated with declared production quantities and GPS-measured plot areas. Each yield value corresponds to a single plot.
Discussion: This dataset, with its extensive data compilation, presents an invaluable resource for agricultural productivity-related studies in West Africa. However, users must navigate its complexities, including potential biases due to survey and due to UML units, and data inconsistencies. The dataset's comprehensive nature requires careful handling and validation in research applications.
Authors Contributions:
Data treatment: Eliott Baboz, Jérémy Lavarenne.
Documentation: Jérémy Lavarenne.
Funding: This project was funded by the INTEN-SAHEL TOSCA project (Centre national d’études spatiales). "123456789" was chosen randomly and is not the actual award number because there is none, but it was mandatory to put one here on Zenodo.
Changelog:
v1.0.0 : initial submission
Three Cases: Metadata and ProceduresThe data sets described here were used in an article submitted to the journal GeoHealth in 2021. The data files and further supplemental links (including general information about GLOBE data) can be accessed at https://observer.globe.gov/get-data/mosquito-habitat-data.Case 1: Removal of records with suspect geolocation data. A Python script was applied to remove records where the measured position (in decimal degrees) was identical to the GLOBE MGRS site position. GPS-obtained latitude and longitude coordinates are reported in decimal degrees, so records identified by whole numbers were also removed. This procedure removed 5704 (23%) of the 24983 records in the Mosquito Habitat Mapper database, with 19,279 records remaining. The secondary data sets cleaned only for geolocation anomalies were labeled Case 1.Case 2: Identifying suspected training events. For this test, we sought to identify groups of data that exceeded 10 records sharing these characteristics. Another Python script was employed to extract the photos for ease of visual inspection. Because we needed to manually review the photo records, we set the threshold for groups at >10, so that the analysis could be completed in the time allotted. Groups identified thought this procedure were outputted as case 2: groups. The resulting data set cleaned of groups >10 was labeled Case 2. The resulting data set included 20,006 records and identified 2,447 records found in clusters we postulated were training events.Case 3: The Case 3 secondary dataset result from applying the Python scripts used to create Cases 1 and 2. We used the Case 3 data sets, with improved geolocation and large groups eliminated, in the following analysis.The information in this description was last updated 2021-04-12
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials
Background
This dataset contains data from monotonic and cyclic loading experiments on structural metallic materials. The materials are primarily structural steels and one iron-based shape memory alloy is also included. Summary files are included that provide an overview of the database and data from the individual experiments is also included.
The files included in the database are outlined below and the format of the files is briefly described. Additional information regarding the formatting can be found through the post-processing library (https://github.com/ahartloper/rlmtp/tree/master/protocols).
Usage
The data is licensed through the Creative Commons Attribution 4.0 International.
If you have used our data and are publishing your work, we ask that you please reference both:
this database through its DOI, and
any publication that is associated with the experiments. See the Overall_Summary and Database_References files for the associated publication references.
Included Files
Overall_Summary_2022-08-25_v1-0-0.csv: summarises the specimen information for all experiments in the database.
Summarized_Mechanical_Props_Campaign_2022-08-25_v1-0-0.csv: summarises the average initial yield stress and average initial elastic modulus per campaign.
Unreduced_Data-#_v1-0-0.zip: contain the original (not downsampled) data
Where # is one of: 1, 2, 3, 4, 5, 6. The unreduced data is broken into separate archives because of upload limitations to Zenodo. Together they provide all the experimental data.
We recommend you un-zip all the folders and place them in one "Unreduced_Data" directory similar to the "Clean_Data"
The experimental data is provided through .csv files for each test that contain the processed data. The experiments are organised by experimental campaign and named by load protocol and specimen. A .pdf file accompanies each test showing the stress-strain graph.
There is a "db_tag_clean_data_map.csv" file that is used to map the database summary with the unreduced data.
The computed yield stresses and elastic moduli are stored in the "yield_stress" directory.
Clean_Data_v1-0-0.zip: contains all the downsampled data
The experimental data is provided through .csv files for each test that contain the processed data. The experiments are organised by experimental campaign and named by load protocol and specimen. A .pdf file accompanies each test showing the stress-strain graph.
There is a "db_tag_clean_data_map.csv" file that is used to map the database summary with the clean data.
The computed yield stresses and elastic moduli are stored in the "yield_stress" directory.
Database_References_v1-0-0.bib
Contains a bibtex reference for many of the experiments in the database. Corresponds to the "citekey" entry in the summary files.
File Format: Downsampled Data
These are the "LP_
The header of the first column is empty: the first column corresponds to the index of the sample point in the original (unreduced) data
Time[s]: time in seconds since the start of the test
e_true: true strain
Sigma_true: true stress in MPa
(optional) Temperature[C]: the surface temperature in degC
These data files can be easily loaded using the pandas library in Python through:
import pandas data = pandas.read_csv(data_file, index_col=0)
The data is formatted so it can be used directly in RESSPyLab (https://github.com/AlbanoCastroSousa/RESSPyLab). Note that the column names "e_true" and "Sigma_true" were kept for backwards compatibility reasons with RESSPyLab.
File Format: Unreduced Data
These are the "LP_
The first column is the index of each data point
S/No: sample number recorded by the DAQ
System Date: Date and time of sample
Time[s]: time in seconds since the start of the test
C_1_Force[kN]: load cell force
C_1_Déform1[mm]: extensometer displacement
C_1_Déplacement[mm]: cross-head displacement
Eng_Stress[MPa]: engineering stress
Eng_Strain[]: engineering strain
e_true: true strain
Sigma_true: true stress in MPa
(optional) Temperature[C]: specimen surface temperature in degC
The data can be loaded and used similarly to the downsampled data.
File Format: Overall_Summary
The overall summary file provides data on all the test specimens in the database. The columns include:
hidden_index: internal reference ID
grade: material grade
spec: specifications for the material
source: base material for the test specimen
id: internal name for the specimen
lp: load protocol
size: type of specimen (M8, M12, M20)
gage_length_mm_: unreduced section length in mm
avg_reduced_dia_mm_: average measured diameter for the reduced section in mm
avg_fractured_dia_top_mm_: average measured diameter of the top fracture surface in mm
avg_fractured_dia_bot_mm_: average measured diameter of the bottom fracture surface in mm
fy_n_mpa_: nominal yield stress
fu_n_mpa_: nominal ultimate stress
t_a_deg_c_: ambient temperature in degC
date: date of test
investigator: person(s) who conducted the test
location: laboratory where test was conducted
machine: setup used to conduct test
pid_force_k_p, pid_force_t_i, pid_force_t_d: PID parameters for force control
pid_disp_k_p, pid_disp_t_i, pid_disp_t_d: PID parameters for displacement control
pid_extenso_k_p, pid_extenso_t_i, pid_extenso_t_d: PID parameters for extensometer control
citekey: reference corresponding to the Database_References.bib file
yield_stress_mpa_: computed yield stress in MPa
elastic_modulus_mpa_: computed elastic modulus in MPa
fracture_strain: computed average true strain across the fracture surface
c,si,mn,p,s,n,cu,mo,ni,cr,v,nb,ti,al,b,zr,sn,ca,h,fe: chemical compositions in units of %mass
file: file name of corresponding clean (downsampled) stress-strain data
File Format: Summarized_Mechanical_Props_Campaign
Meant to be loaded in Python as a pandas DataFrame with multi-indexing, e.g.,
tab1 = pd.read_csv('Summarized_Mechanical_Props_Campaign_' + date + version + '.csv', index_col=[0, 1, 2, 3], skipinitialspace=True, header=[0, 1], keep_default_na=False, na_values='')
citekey: reference in "Campaign_References.bib".
Grade: material grade.
Spec.: specifications (e.g., J2+N).
Yield Stress [MPa]: initial yield stress in MPa
size, count, mean, coefvar: number of experiments in campaign, number of experiments in mean, mean value for campaign, coefficient of variation for campaign
Elastic Modulus [MPa]: initial elastic modulus in MPa
size, count, mean, coefvar: number of experiments in campaign, number of experiments in mean, mean value for campaign, coefficient of variation for campaign
Caveats
The files in the following directories were tested before the protocol was established. Therefore, only the true stress-strain is available for each:
A500
A992_Gr50
BCP325
BCR295
HYP400
S460NL
S690QL/25mm
S355J2_Plates/S355J2_N_25mm and S355J2_N_50mm
Automatically describing images using natural sentences is an essential task to visually impaired people's inclusion on the Internet. Although there are many datasets in the literature, most of them contain only English captions, whereas datasets with captions described in other languages are scarce.
PraCegoVer arose on the Internet, stimulating users from social media to publish images, tag #PraCegoVer and add a short description of their content. Inspired by this movement, we have proposed the #PraCegoVer, a multi-modal dataset with Portuguese captions based on posts from Instagram. It is the first large dataset for image captioning in Portuguese with freely annotated images.
Dataset Structure
containing the images. The file dataset.json comprehends a list of json objects with the attributes:
user: anonymized user that made the post;
filename: image file name;
raw_caption: raw caption;
caption: clean caption;
date: post date.
Each instance in dataset.json is associated with exactly one image in the images directory whose filename is pointed by the attribute filename. Also, we provide a sample with five instances, so the users can download the sample to get an overview of the dataset before downloading it completely.
Download Instructions
If you just want to have an overview of the dataset structure, you can download sample.tar.gz. But, if you want to use the dataset, or any of its subsets (63k and 173k), you must download all the files and run the following commands to uncompress and join the files:
cat images.tar.gz.part* > images.tar.gz tar -xzvf images.tar.gz
Alternatively, you can download the entire dataset from the terminal using the python script download_dataset.py available in PraCegoVer repository. In this case, first, you have to download the script and create an access token here. Then, you can run the following command to download and uncompress the image files:
python download_dataset.py --access_token=
Using Machine Learning Techniques in general and Deep Learning techniques in specific needs a certain amount of data often not available in large quantities in some technical domains. The manual inspection of Machine Tool Components, as well as the manual end of line check of products, are labour intensive tasks in industrial applications that often want to be automated by companies. To automate the classification processes and to develop reliable and robust Machine Learning based classification and wear prognostics models there is a need for real-world datasets to train and test models on. The dataset contains 1104 channel 3 images with 394 image-annotations for the surface damage type “pitting”. The annotations made with the annotation tool labelme, are available in JSON format and hence convertible to VOC and COCO format. All images come from two BSD types. The dataset available for download is divided into two folders, data with all images as JPEG, label with all annotations, and saved_model with a baseline model. The authors also provide a python script to divide the data and labels into three different split types – train_test_split, which splits images into the same train and test data-split the authors used for the baseline model, wear_dev_split, which creates all 27 wear developments and type_split, which splits the data into the occurring BSD-types. One of the two mentioned BSD types is represented with 69 images and 55 different image-sizes. All images with this BSD type come either in a clean or soiled condition. The other BSD type is shown on 325 images with two image-sizes. Since all images of this type have been taken with continuous time the degree of soiling is evolving. Also, the dataset contains as above mentioned 27 pitting development sequences with every 69 images. Instruction dataset split The authors of this dataset provide 3 types of different dataset splits. To get the data split you have to run the python script split_dataset.py. Script inputs: split-type (mandatory) output directory (mandatory) Different split-types: train_test_split: splits dataset into train and test data (80%/20%) wear_dev_split: splits dataset into 27 wear-developments type_split: splits dataset into different BSD types Example: C:\Users\Desktop>python split_dataset.py --split_type=train_test_split --output_dir=BSD_split_folder
Abstract copyright UK Data Service and data collection copyright owner.
The heat pump monitoring datasets are a key output of the Electrification of Heat Demonstration (EoH) project, a government-funded heat pump trial assessing the feasibility of heat pumps across the UK’s diverse housing stock. These datasets are provided in both cleansed and raw form and allow analysis of the initial performance of the heat pumps installed in the trial. From the datasets, insights such as heat pump seasonal performance factor (a measure of the heat pump's efficiency), heat pump performance during the coldest day of the year, and half-hourly performance to inform peak demand can be gleaned.
For the second edition (December 2024), the data were updated to include cleaned performance data collected between November 2020 and September 2023. The only documentation currently available with the study is the Excel data dictionary. Reports and other contextual information can be found on the Energy Systems Catapult website.
The EoH project was funded by the Department of Business, Energy and Industrial Strategy. From 2023, it is covered by the new Department for Energy Security and Net Zero.
Data availability
This study comprises the open-access cleansed data from the EoH project and a summary dataset, available in four zipped files (see the 'Access Data' tab). Users must download all four zip files to obtain the full set of cleansed data and accompanying documentation.
When unzipped, the full cleansed data comprises 742 CSV files. Most of the individual CSV files are too large to open in Excel. Users should ensure they have sufficient computing facilities to analyse the data.
The UKDS also holds an accompanying study, SN 9049 Electrification of Heat Demonstration Project: Heat Pump Performance Raw Data, 2020-2023, which is available only to registered UKDS users. This contains the raw data from the EoH project. Since the data are very large, only the summary dataset is available to download; an order must be placed for FTP delivery of the remaining raw data. Other studies in the set include SN 9209, which comprises 30-minute interval heat pump performance data, and SN 9210, which includes daily heat pump performance data.
The Python code used to cleanse the raw data and then perform the analysis is accessible via the
Energy Systems Catapult Github
Heat Pump Performance across the BEIS funded heat pump trial, The Electrification of Heat (EoH) Demonstration Project. See the documentation for data contents.
Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('wikipedia', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The magnetotelluric (MT) method is increasingly being applied to a wide variety of geoscience problems. However, the software available for MT data analysis and interpretation is still very limited …Show full descriptionThe magnetotelluric (MT) method is increasingly being applied to a wide variety of geoscience problems. However, the software available for MT data analysis and interpretation is still very limited in comparison to many of the more mature geophysical methods such as the gravity, magnetic or seismic reflection methods. MTPy is an open source Python package to assist with MT data processing, analysis, modelling, visualization and interpretation. It was initiated at the University of Adelaide in 2013 as a means to store and share Python code amongst the MT community (Krieger and Peacock 2014). Here we provide an overview of the software and describe recent developments to MTPy. These include new functionality and a clean up and standardisation of the source code, as well as the addition of an integrated testing suite, documentation, and examples in order to facilitate the use of MT in the wider geophysics community.
description: The Burmese python (Python molurus bivittatus), a native to Southeast Asia, can reach a length greater than twenty feet (Wall 1921, Pope 1961). This python is a long lived (15 - 25 years) behavioral, habitat, and dietary generalist, capable of producing large clutches of eggs (8 - 107) (Lederer 1956, Branch and Erasmus 1984). Observations of Burmese pythons exist in the United States primarily from locations within Everglades National Park (ENP), including; along the Main Park Road in the saline and freshwater glades, and mangroves, between Pay-hay-okee and Flamingo, the greater Long Pine Key area (including Hole-in-the-Donut), and the greater Shark Valley area along the Tamiami Trail (including L-67 Ext.). The non-native species has also been observed repeatedly on the eastern boundary of ENP, along canal levees, in the remote mangrove backcountry, and in Big Cypress National Preserve. From 2002 (when the numbers first began to climb) to 2005, 201 pythons were captured and removed or found dead. In 2006-2007 alone, that number more than doubled to 418. Measured total length for snakes recovered ranged from 0.5 m to 4.5 m including five hatchling-sized animals recovered in the summer of 2004, and two hatchlings captured in 2005. In 2008, 343 pythons were removed, and so far in 2009, 347 individuals have been removed. The non-native semi-aquatic pythons's diet in southern Florida includes raccoon, rabbit, muskrat, squirrel, opossum, cotton rat, black rat, bobcat, house wren, pied-billed grebe, white ibis, limpkin, alligator and endangered Key Largo wood rat. As Python molurus is known to eat birds, and also known to frequent wading bird colonies in their native range, the proximity of python sightings to the Paurotis Pond and Tamiami West wood stork rookeries is troubling. The potential for pythons to eat Mangrove Fox Squirrels and Cape Sable Seaside Sparrows and to compete with Indigos Snakes is also of concern. Burmese Pythons present a potential threat to successful ecological restoration of the greater Everglades (NRC 2005). Pythons are now established and breeding in South Florida. Python molurus bivittatus has the potential to occupy the entire footprint of the Comprehensive Everglades Restoration Project (CERP), adversely impacting valued resources across the landscape. Proposed management and control actions must include research strategies and further evaluation of potential impacts of pythons. The results of this project will be applied to develop a comprehensive, science-based control and containment program. The proposed project will also increase our understanding of the impacts of Burmese pythons on native fauna in DOI and surrounding lands. Dealing with established exotic species requires that we understand their status and impacts, and how to remove them. A current priority item for determining status is finding out the extent of invasion by established species. Once we know where the threat is occurring, we need a better understanding of how the threat may manifest itself ecologically-that is, what are the impacts of invasion? We can hypothesize that Burmese pythons compete with native snakes or affect populations of prey species; however, knowing with certainty that pythons eat wood rats, for example, better focuses eradication efforts and spurs action. A study of diet of Burmese pythons directly addresses this issue. Further, knowing how much pythons eat through a bioenergetic model allows us to forecast with more certainty predation impacts on native fauna.; abstract: The Burmese python (Python molurus bivittatus), a native to Southeast Asia, can reach a length greater than twenty feet (Wall 1921, Pope 1961). This python is a long lived (15 - 25 years) behavioral, habitat, and dietary generalist, capable of producing large clutches of eggs (8 - 107) (Lederer 1956, Branch and Erasmus 1984). Observations of Burmese pythons exist in the United States primarily from locations within Everglades National Park (ENP), including; along the Main Park Road in the saline and freshwater glades, and mangroves, between Pay-hay-okee and Flamingo, the greater Long Pine Key area (including Hole-in-the-Donut), and the greater Shark Valley area along the Tamiami Trail (including L-67 Ext.). The non-native species has also been observed repeatedly on the eastern boundary of ENP, along canal levees, in the remote mangrove backcountry, and in Big Cypress National Preserve. From 2002 (when the numbers first began to climb) to 2005, 201 pythons were captured and removed or found dead. In 2006-2007 alone, that number more than doubled to 418. Measured total length for snakes recovered ranged from 0.5 m to 4.5 m including five hatchling-sized animals recovered in the summer of 2004, and two hatchlings captured in 2005. In 2008, 343 pythons were removed, and so far in 2009, 347 individuals have been removed. The non-native semi-aquatic pythons's diet in southern Florida includes raccoon, rabbit, muskrat, squirrel, opossum, cotton rat, black rat, bobcat, house wren, pied-billed grebe, white ibis, limpkin, alligator and endangered Key Largo wood rat. As Python molurus is known to eat birds, and also known to frequent wading bird colonies in their native range, the proximity of python sightings to the Paurotis Pond and Tamiami West wood stork rookeries is troubling. The potential for pythons to eat Mangrove Fox Squirrels and Cape Sable Seaside Sparrows and to compete with Indigos Snakes is also of concern. Burmese Pythons present a potential threat to successful ecological restoration of the greater Everglades (NRC 2005). Pythons are now established and breeding in South Florida. Python molurus bivittatus has the potential to occupy the entire footprint of the Comprehensive Everglades Restoration Project (CERP), adversely impacting valued resources across the landscape. Proposed management and control actions must include research strategies and further evaluation of potential impacts of pythons. The results of this project will be applied to develop a comprehensive, science-based control and containment program. The proposed project will also increase our understanding of the impacts of Burmese pythons on native fauna in DOI and surrounding lands. Dealing with established exotic species requires that we understand their status and impacts, and how to remove them. A current priority item for determining status is finding out the extent of invasion by established species. Once we know where the threat is occurring, we need a better understanding of how the threat may manifest itself ecologically-that is, what are the impacts of invasion? We can hypothesize that Burmese pythons compete with native snakes or affect populations of prey species; however, knowing with certainty that pythons eat wood rats, for example, better focuses eradication efforts and spurs action. A study of diet of Burmese pythons directly addresses this issue. Further, knowing how much pythons eat through a bioenergetic model allows us to forecast with more certainty predation impacts on native fauna.
This point layer contains monthly summaries of daily temperatures (means, minimums, and maximums) and precipitation levels (sum, lowest, and highest) for the period January 1981 through December 2010 for weather stations in the Global Historical Climate Network Daily (GHCND). Data in this service were obtained from web services hosted by the Applied Climate Information System ( ACIS). ACIS staff curate the values for the U.S., including correcting erroneous values, reconciling data from stations that have been moved over their history, etc. The data were compiled at Esri from publicly available sources hosted and administered by NOAA. Because the ACIS data is updated and corrected on an ongoing basis, the date of collection for this layer was Jan 23, 2019. The following process was used to produce this dataset:Download the most current list of stations from ftp.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt. Import this into Microsoft Excel and save as CSV. In ArcGIS, import the CSV as a geodatabase table and use the XY Event layer tool to locate each point. Using a detailed U.S. boundary extract the points that fall within the 50 U.S. States, the District of Columbia, and Puerto Rico. Using Python with DA.UpdateCursor and urllib2 access the ACIS Web Services API to determine whether each station had at least 50 monthly values of temperature data for each station. Delete the other stations. Using Python add the necessary field names and acquire all monthly values for the remaining stations. Thus, there are stations that have some missing data. Using Python Add fields and convert the standard values to metric values so both would be present. Thus, there are four sets of monthly data in this dataset: Monthly means, mins, and maxes of daily temperatures - degrees Fahrenheit. Monthly mean of monthly sums of precipitation and the level of precipitation that was the minimum and maximum during the period 1981 to 2010 - mm. Temperatures in 3a. in degrees Celcius. Precipitation levels in 3b in Inches. After initially publishing these data in a different service, it was learned that more precise coordinates for station locations were available from the Enhanced Master Station History Report (EMSHR) published by NOAA NCDC. With the publication of this layer these most precise coordinates are used. A large subset of the EMSHR metadata is available via EMSHR Stations Locations and Metadata 1738 to Present. If your study area includes areas outside of the U.S., use the World Historical Climate - Monthly Averages for GHCN-D Stations 1981 - 2010 layer. The data in this layer come from the same source archive, however, they are not curated by the ACIS staff and may contain errors. Revision History: Initially Published: 23 Jan 2019 Updated 16 Apr 2019 - We learned more precise coordinates for station locations were available from the Enhanced Master Station History Report (EMSHR) published by NOAA NCDC. With the publication of this layer the geometry and attributes for 3,222 of 9,636 stations now have more precise coordinates. The schema was updated to include the NCDC station identifier and elevation fields for feet and meters are also included. A large subset of the EMSHR data is available via EMSHR Stations Locations and Metadata 1738 to Present. Cite as: Esri, 2019: U.S. Historical Climate - Monthly Averages for GHCN-D Stations for 1981 - 2010. ArcGIS Online, Accessed
https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html
Replication pack, FSE2018 submission #164: ------------------------------------------
**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: A Case Study of the PyPI Ecosystem **Note:** link to data artifacts is already included in the paper. Link to the code will be included in the Camera Ready version as well. Content description =================== - **ghd-0.1.0.zip** - the code archive. This code produces the dataset files described below - **settings.py** - settings template for the code archive. - **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset. This dataset only includes stats aggregated by the ecosystem (PyPI) - **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages themselves, which take around 2TB. - **build_model.r, helpers.r** - R files to process the survival data (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, `common.cache/survival_data.pypi_2008_2017-12_6.csv` in **dataset_full_Jan_2018.tgz**) - **Interview protocol.pdf** - approximate protocol used for semistructured interviews. - LICENSE - text of GPL v3, under which this dataset is published - INSTALL.md - replication guide (~2 pages)
Replication guide ================= Step 0 - prerequisites ---------------------- - Unix-compatible OS (Linux or OS X) - Python interpreter (2.7 was used; Python 3 compatibility is highly likely) - R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible) Depending on detalization level (see Step 2 for more details): - up to 2Tb of disk space (see Step 2 detalization levels) - at least 16Gb of RAM (64 preferable) - few hours to few month of processing time Step 1 - software ---------------- - unpack **ghd-0.1.0.zip**, or clone from gitlab: git clone https://gitlab.com/user2589/ghd.git git checkout 0.1.0 `cd` into the extracted folder. All commands below assume it as a current directory. - copy `settings.py` into the extracted folder. Edit the file: * set `DATASET_PATH` to some newly created folder path * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` - install docker. For Ubuntu Linux, the command is `sudo apt-get install docker-compose` - install libarchive and headers: `sudo apt-get install libarchive-dev` - (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools` Without this dependency, you might get an error on the next step, but it's safe to ignore. - install Python libraries: `pip install --user -r requirements.txt` . - disable all APIs except GitHub (Bitbucket and Gitlab support were not yet implemented when this study was in progress): edit `scraper/init.py`, comment out everything except GitHub support in `PROVIDERS`. Step 2 - obtaining the dataset ----------------------------- The ultimate goal of this step is to get output of the Python function `common.utils.survival_data()` and save it into a CSV file: # copy and paste into a Python console from common import utils survival_data = utils.survival_data('pypi', '2008', smoothing=6) survival_data.to_csv('survival_data.csv') Since full replication will take several months, here are some ways to speedup the process: ####Option 2.a, difficulty level: easiest Just use the precomputed data. Step 1 is not necessary under this scenario. - extract **dataset_minimal_Jan_2018.zip** - get `survival_data.csv`, go to the next step ####Option 2.b, difficulty level: easy Use precomputed longitudinal feature values to build the final table. The whole process will take 15..30 minutes. - create a folder `
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The folder named “submission” contains the following:
ijgis.yml
: This file lists all the Python libraries and dependencies required to run the code.ijgis.yml
file to create a Python project and environment. Ensure you activate the environment before running the code.pythonProject
folder contains several .py
files and subfolders, each with specific functionality as described below..png
file for each column of the raw gaze and IMU recordings, color-coded with logged events..csv
files.overlapping_sliding_window_loop.py
.plot_labels_comparison(df, save_path, x_label_freq=10, figsize=(15, 5))
in line 116 visualizes the data preparation results. As this visualization is not used in the paper, the line is commented out, but if you want to see visually what has been changed compared to the original data, you can comment out this line..csv
files in the results folder.This part contains three main code blocks:
iii. One for the XGboost code with correct hyperparameter tuning:
Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically test the confidence threshold of
Note: Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically calculated the confidence threshold of the model (explained in the paper in Section 5.2. Part II: Decoding surveillance by sequence analysis) is given in this block in lines 361 to 380.
.csv
file containing inferred labels.The data is licensed under CC-BY, the code is licensed under MIT.