Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset compares four cities FIXED-line broadband internet speeds: - Melbourne, AU - Bangkok, TH - Shanghai, CN - Los Angeles, US - Alice Springs, AU
ERRATA: 1.Data is for Q3 2020, but some files are labelled incorrectly as 02-20 of June 20. They all should read Sept 20, or 09-20 as Q3 20, rather than Q2. Will rename and reload. Amended in v7.
*lines of data for each geojson file; a line equates to a 600m^2 location, inc total tests, devices used, and average upload and download speed - MEL 16181 locations/lines => 0.85M speedtests (16.7 tests per 100people) - SHG 31745 lines => 0.65M speedtests (2.5/100pp) - BKK 29296 lines => 1.5M speedtests (14.3/100pp) - LAX 15899 lines => 1.3M speedtests (10.4/100pp) - ALC 76 lines => 500 speedtests (2/100pp)
Geojsons of these 2* by 2* extracts for MEL, BKK, SHG now added, and LAX added v6. Alice Springs added v15.
This dataset unpacks, geospatially, data summaries provided in Speedtest Global Index (linked below). See Jupyter Notebook (*.ipynb) to interrogate geo data. See link to install Jupyter.
** To Do Will add Google Map versions so everyone can see without installing Jupyter. - Link to Google Map (BKK) added below. Key:Green > 100Mbps(Superfast). Black > 500Mbps (Ultrafast). CSV provided. Code in Speedtestv1.1.ipynb Jupyter Notebook. - Community (Whirlpool) surprised [Link: https://whrl.pl/RgAPTl] that Melb has 20% at or above 100Mbps. Suggest plot Top 20% on map for community. Google Map link - now added (and tweet).
** Python melb = au_tiles.cx[144:146 , -39:-37] #Lat/Lon extract shg = tiles.cx[120:122 , 30:32] #Lat/Lon extract bkk = tiles.cx[100:102 , 13:15] #Lat/Lon extract lax = tiles.cx[-118:-120, 33:35] #lat/Lon extract ALC=tiles.cx[132:134, -22:-24] #Lat/Lon extract
Histograms (v9), and data visualisations (v3,5,9,11) will be provided. Data Sourced from - This is an extract of Speedtest Open data available at Amazon WS (link below - opendata.aws).
**VERSIONS v.24 Add tweet and google map of Top 20% (over 100Mbps locations) in Mel Q322. Add v.1.5 MEL-Superfast notebook, and CSV of results (now on Google Map; link below). v23. Add graph of 2022 Broadband distribution, and compare 2020 - 2022. Updated v1.4 Jupyter notebook. v22. Add Import ipynb; workflow-import-4cities. v21. Add Q3 2022 data; five cities inc ALC. Geojson files. (2020; 4.3M tests 2022; 2.9M tests)
v20. Speedtest - Five Cities inc ALC. v19. Add ALC2.ipynb. v18. Add ALC line graph. v17. Added ipynb for ALC. Added ALC to title.v16. Load Alice Springs Data Q221 - csv. Added Google Map link of ALC. v15. Load Melb Q1 2021 data - csv. V14. Added Melb Q1 2021 data - geojson. v13. Added Twitter link to pics. v12 Add Line-Compare pic (fastest 1000 locations) inc Jupyter (nbn-intl-v1.2.ipynb). v11 Add Line-Compare pic, plotting Four Cities on a graph. v10 Add Four Histograms in one pic. v9 Add Histogram for Four Cities. Add NBN-Intl.v1.1.ipynb (Jupyter Notebook). v8 Renamed LAX file to Q3, rather than 03. v7 Amended file names of BKK files to correctly label as Q3, not Q2 or 06. v6 Added LAX file. v5 Add screenshot of BKK Google Map. v4 Add BKK Google map(link below), and BKK csv mapping files. v3 replaced MEL map with big key version. Prev key was very tiny in top right corner. v2 Uploaded MEL, SHG, BKK data and Jupyter Notebook v1 Metadata record
** LICENCE AWS data licence on Speedtest data is "CC BY-NC-SA 4.0", so use of this data must be: - non-commercial (NC) - reuse must be share-alike (SA)(add same licence). This restricts the standard CC-BY Figshare licence.
** Other uses of Speedtest Open Data; - see link at Speedtest below.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This JavaScript code has been developed to retrieve NDSI_Snow_Cover from MODIS version 6 for SNOTEL sites using the Google Earth Engine platform. To successfully run the code, you should have a Google Earth Engine account. An input file, called NWM_grid_Western_US_polygons_SNOTEL_ID.zip, is required to run the code. This input file includes 1 km grid cells of the NWM containing SNOTEL sites. You need to upload this input file to the Assets tap in the Google Earth Engine code editor. You also need to import the MOD10A1.006 Terra Snow Cover Daily Global 500m collection to the Google Earth Engine code editor. You may do this by searching for the product name in the search bar of the code editor.
The JavaScript works for s specified time range. We found that the best period is a month, which is the maximum allowable time range to do the computation for all SNOTEL sites on Google Earth Engine. The script consists of two main loops. The first loop retrieves data for the first day of a month up to day 28 through five periods. The second loop retrieves data from day 28 to the beginning of the next month. The results will be shown as graphs on the right-hand side of the Google Earth Engine code editor under the Console tap. To save results as CSV files, open each time-series by clicking on the button located at each graph's top right corner. From the new web page, you can click on the Download CSV button on top.
Here is the link to the script path: https://code.earthengine.google.com/?scriptPath=users%2Figarousi%2Fppr2-modis%3AMODIS-monthly
Then, run the Jupyter Notebook (merge_downloaded_csv_files.ipynb) to merge the downloaded CSV files that are stored for example in a folder called output/from_GEE into one single CSV file which is merged.csv. The Jupyter Notebook then applies some preprocessing steps and the final output is NDSI_FSCA_MODIS_C6.csv.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This publication contains several datasets that have been used in the paper "Crowdsourcing open citations with CROCI – An analysis of the current status of open citations, and a proposal" submitted to the 17th International Conference on Scientometrics and Bibliometrics (ISSI 2019), available at https://opencitations.wordpress.com/2019/02/07/crowdsourcing-open-citations-with-croci/.
Additional information about the analyses described in the paper, including the code and the data we have used to compute all the figures, is available as a Jupyter notebook at https://github.com/sosgang/pushing-open-citations-issi2019/blob/master/script/croci_nb.ipynb. The datasets contain the following information.
non_open.zip: it is a zipped (~5 GB unzipped) CSV file containing the numbers of open citations and closed citations received by the entities in the Crossref dump used in our computation, dated October 2018. All the entity types retrieved from Crossref were aligned to one of following five categories: journal, book, proceedings, dataset, other. The open CC0 citation data we used came from the CSV dump of most recent release of COCI dated 12 November 2018. The number of closed citations was calculated by subtracting the number of open citations to each entity available within COCI from the value “is-referenced-by-count” available in the Crossref metadata for that particular cited entity, which reports all the DOI-to-DOI citation links that point to the cited entity from within the whole Crossref database (including those present in the Crossref ‘closed’ dataset).
The columns of the CSV file are the following ones:
doi: the DOI of the publication in Crossref;
type: the type of the publication as indicated in Crossref;
cited_by: the number of open citations received by the publication according to COCI;
non_open: the number of closed citations received by the publication according to Crossref + COCI.
croci_types.csv: it is a CSV file that contains the numbers of open citations and closed citations received by the entities in the Crossref dump used in our computation, as collected in the previous CSV file, alligned in five classes depening on the entity types retrieved from Crossref: journal (Crossref types: journal-article, journal-issue, journal-volume, journal), book (Crossref types: book, book-chapter, book-section, monograph, book track, book-part, book-set, reference-book, dissertation, book series, edited book), proceedings (Crossref types: proceedings-article, proceedings, proceedings-series), dataset (Crossref types: dataset), other (Crossref types: other, report, peer review, reference-entry, component, report-series, standard, posted-content, standard-series).
The columns of the CSV file are the following ones:
type: the type publication between "journal", "book", "proceedings", "dataset", "other";
label: the label assigned to the type for visualisation purposes;
coci_open_cit: the number of open citations received by the publication type according to COCI;
crossref_close_cit: the number of closed citations received by the publication according to Crossref + COCI.
publishers_cits.csv: it is a CSV file that contains the top twenty publishers that received the greatest number of open citations. The columns of the CSV file are the following ones:
publisher: the name of the publisher;
doi_prefix: the list of DOI prefixes used assigned by the publisher;
coci_open_cit: the number of open citations received by the publications of the publisher according to COCI;
crossref_close_cit: the number of closed citations received by the publications of the publishers according to Crossref + COCI;
total_cit: the total number of citations received by the publications of the publisher (= coci_open_cit + crossref_close_cit).
20publishers_cr.csv: it is a CSV file that contains the numbers of the contributions to open citations made by the twenty publishers introduced in the previous CSV file as of 24 January 2018, according to the data available through the Crossref API. The counts listed in this file refers to the number of publications for which each publisher has submitted metadata to Crossref that include the publication’s reference list. The categories 'closed', 'limited' and 'open' refer to publications for which the reference lists are not visible to anyone outside the Crossref Cited-by membership, are visible only to them and to Crossref Metadata Plus members, or are visible to all, respectively. In addition, the file also record the total number of publications for which the publisher has submitted metadata to Crossref, whether or not those metadata include the reference lists of those publications.
The columns of the CSV file are the following ones:
publisher: the name of the publisher;
open: the number of publications in Crossref with an 'open' visibility for their reference lists;
limited: the number of publications in Crossref with an 'limited' visibility for their reference lists;
closed: the number of publications in Crossref with an 'closed' visibility for their reference lists;
overall_deposited: the overall number of publications for which the publisher has submitted metadata to Crossref.
Facebook
TwitterObjective Daily COVID-19 data reported by the World Health Organization (WHO) may provide the basis for political ad hoc decisions including travel restrictions. Data reported by countries, however, is heterogeneous and metrics to evaluate its quality are scarce. In this work, we analyzed COVID-19 case counts provided by WHO and developed tools to evaluate country-specific reporting behaviors. Methods In this retrospective cross-sectional study, COVID-19 data reported daily to WHO from 3rd January 2020 until 14th June 2021 were analyzed. We proposed the concepts of binary reporting rate and relative reporting behavior and performed descriptive analyses for all countries with these metrics. We developed a score to evaluate the consistency of incidence and binary reporting rates. Further, we performed spectral clustering of the binary reporting rate and relative reporting behavior to identify salient patterns in these metrics. Results Our final analysis included 222 countries and regions...., Data collection COVID-19 data was downloaded from WHO. Using a public repository, we have added the countries' full names to the WHO data set using the two-letter abbreviations for each country to merge both data sets. The provided COVID-19 data covers January 2020 until June 2021. We uploaded the final data set used for the analyses of this paper. Data processing We processed data using a Jupyter Notebook with a Python kernel and publically available external libraries. This upload contains the required Jupyter Notebook (reporting_behavior.ipynb) with all analyses and some additional work, a README, and the conda environment yml (env.yml)., Any text editor including Microsoft Excel and their free alternatives can open the uploaded CSV file. Any web browser and some code editors (like the freely available Visual Studio Code) can show the uploaded Jupyter Notebook if the required Python environment is set up correctly.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This S&M-HSTPM2d5 dataset contains the high spatial and temporal resolution of the particulates (PM2.5) measures with the corresponding timestamp and GPS location of mobile and static devices in the three Chinese cities: Foshan, Cangzhou, and Tianjin. Different numbers of static and mobile devices were set up in each city. The sampling rate was set up as one minute in Cangzhou, and three seconds in Foshan and Tianjin. For the specific detail of the setup, please refer to the Device_Setup_Description.txt file in this repository and the data descriptor paper.
After the data collection process, the data cleaning process was performed to remove and adjust the abnormal and drifting data. The script of the data cleaning algorithm is provided in this repository. The data cleaning algorithm only adjusts or removes individual data points. The removal of the entire device's data was done after the data cleaning algorithm with empirical judgment and graphic visualization. For specific detail of the data cleaning process, please refer to the script (Data_cleaning_algorithm.ipynb) in this repository and the data descriptor paper.
The dataset in this repository is the processed version. The raw dataset and removed devices are not included in this repository.
The data is stored as a CSV file. Each CSV file which is named by the device ID represents the data that was collected by the corresponding device. Each CSV file has three types of data: timestamp as the China Standard Time (GMT+8), geographic location as latitude and longitude, and PM2.5 concentration with the unit of microgram per cubic meter. The CSV files are stored in either Static or Mobile folder which represents the devices' type. The Static and Mobile folder are stored in the corresponding city's folder.
To access the dataset, any programming language that can access CSV files is appropriate. Users can also open the CSV file directly. The get_dataset.ipynb file in this repository also provides an option of accessing the dataset. To successfully execute ipynb file, Jupyter Notebook with Python 3.0 is required. The following python library is also required:
get_dataset.ipynb: 1. os library 2. pandas library
Data_cleaning_algorithm.ipynb: 1. os library 2. pandas library 3. datetime library 4. math library
The instruction of installing the libraries above can be found online. After installing the Jupyter Notebook with Python 3.0 and the required libraries, users can try to open the ipynb file with Jupyter Notebook and follow the instruction inside the file.
For questions or suggestions please e-mail Xinlei Chen
Facebook
TwitterList of renewable energy power stations. This Data Package contains a list of renewable energy power plants in lists of renewable energy-based power plants of Germany, Denmark, France, Switzerland, the United Kingdom and Poland. Germany: More than 1.7 million renewable power plant entries, eligible under the renewable support scheme (EEG). Denmark: Wind and phovoltaic power plants with a high level of detail. France: Aggregated capacity and number of installations per energy source per municipality (Commune). Poland: Summed capacity and number of installations per energy source per municipality (Powiat). Switzerland: Renewable power plants eligible under the Swiss feed in tariff KEV (Kostendeckende Einspeisevergütung). United Kingdom: Renewable power plants in the United Kingdom. Due to different data availability, the power plant lists are of different accurancy and partly provide different power plant parameter. Due to that, the lists are provided as seperate csv-files per country and as separate sheets in the excel file. Suspect data or entries with high probability of duplication are marked in the column 'comment'. Theses validation markers are explained in the file validation_marker.csv. Additionally, the Data Package includes daily time series of cumulated installed capacity per energy source type for Germany, Denmark, Switzerland and the United Kingdom. All data processing is conducted in Python and pandas and has been documented in the Jupyter Notebooks linked below.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data and code needed to reproduce the results of the paper "Effects of community management on user activity in online communities", available in draft here.
Instructions:
Please note: I use both Stata and Jupyter Notebook interactively, running a block with a few lines of code at a time. Expect to have to change directories, file names etc.
Facebook
TwitterThis dataset contains data and code from the manuscript:Heintzman, L.J., McIntyre, N.E., Langendoen, E.J., & Read, Q.D. (2024). Cultivation and dynamic cropping processes impart land-cover heterogeneity within agroecosystems: a metrics-based case study in the Yazoo-Mississippi Delta (USA). Landscape Ecology 39, 29 (2024). https://doi.org/10.1007/s10980-024-01797-0There are 14 rasters of land use and land cover data for the study region, in .tif format with associated auxiliary files, two shape files with county boundaries and study area extent, a CSV file with summary information derived from the rasters, and a Jupyter notebook containing Python code.The rasters included here represent an intermediate data product. Original unprocessed rasters from NASS CropScape are not included here, nor is the code to process them.List of filesMS_Delta_maps.zipMSDeltaCounties_UTMZone15N.shp: Depiction of the 19 counties (labeled) that intersect the Mississippi Alluvial Plain in western Mississippi.MS_Delta_MAP_UTMZone15N.shp: Depiction of the study area extent.mf8h_20082021.zipmf8h_XXXX.tif: Yearly, reclassified and majority filtered LULC data used to build comboall1.csv - derived from USDA NASS CropScape. There are 14 .tif files total for years 2008-2021. Each .tif file includes auxiliary files with the same file name and the following extensions: .tfw, .tif.aux.xml, .tif.ovr., .tif.vat.cpg., .tif.vat.dbf.comboall1.csv: Combined dataset of LULC information for all 14 years in study period.analysis.ipynb_.txt: Jupyter Notebook used to analyze comboall1.csv. Convert to .ipynb format to open with Jupyter.This research was conducted under USDA Agricultural Research Service, National Program 211 (Water Availability and Watershed Management).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Hope_Park_original.csv file.## Contents- sample park analysis.ipynb — The main analysis notebook (Colab/Jupyter format)- Hope_Park_original.csv — Source dataset containing park information- README.md — Documentation for the contents and usage## Usage1. Open the notebook in Google Colab or Jupyter.2. Upload the Hope_Park_original.csv file to the working directory (or adjust the file path in the notebook).3. Run each cell sequentially to reproduce the analysis.## RequirementsThe notebook uses standard Python data science libraries:```pythonpandasnumpymatplotlibseaborn
Facebook
TwitterT1DiabetesGranada
A longitudinal multi-modal dataset of type 1 diabetes mellitus
Documented by:
Rodriguez-Leon, C., Aviles-Perez, M. D., Banos, O., Quesada-Charneco, M., Lopez-Ibarra, P. J., Villalonga, C., & Munoz-Torres, M. (2023). T1DiabetesGranada: a longitudinal multi-modal dataset of type 1 diabetes mellitus. Scientific Data, 10(1), 916. https://doi.org/10.1038/s41597-023-02737-4
Background
Type 1 diabetes mellitus (T1D) patients face daily difficulties in keeping their blood glucose levels within appropriate ranges. Several techniques and devices, such as flash glucose meters, have been developed to help T1D patients improve their quality of life. Most recently, the data collected via these devices is being used to train advanced artificial intelligence models to characterize the evolution of the disease and support its management. The main problem for the generation of these models is the scarcity of data, as most published works use private or artificially generated datasets. For this reason, this work presents T1DiabetesGranada, a open under specific permission longitudinal dataset that not only provides continuous glucose levels, but also patient demographic and clinical information. The dataset includes 257780 days of measurements over four years from 736 T1D patients from the province of Granada, Spain. This dataset progresses significantly beyond the state of the art as one the longest and largest open datasets of continuous glucose measurements, thus boosting the development of new artificial intelligence models for glucose level characterization and prediction.
Data Records
The data are stored in four comma-separated values (CSV) files which are available in T1DiabetesGranada.zip. These files are described in detail below.
Patient_info.csv
Patient_info.csv is the file containing information about the patients, such as demographic data, start and end dates of blood glucose level measurements and biochemical parameters, number of biochemical parameters or number of diagnostics. This file is composed of 736 records, one for each patient in the dataset, and includes the following variables:
Patient_ID – Unique identifier of the patient. Format: LIB19XXXX.
Sex – Sex of the patient. Values: F (for female), masculine (for male)
Birth_year – Year of birth of the patient. Format: YYYY.
Initial_measurement_date – Date of the first blood glucose level measurement of the patient in the Glucose_measurements.csv file. Format: YYYY-MM-DD.
Final_measurement_date – Date of the last blood glucose level measurement of the patient in the Glucose_measurements.csv file. Format: YYYY-MM-DD.
Number_of_days_with_measures – Number of days with blood glucose level measurements of the patient, extracted from the Glucose_measurements.csv file. Values: ranging from 8 to 1463.
Number_of_measurements – Number of blood glucose level measurements of the patient, extracted from the Glucose_measurements.csv file. Values: ranging from 400 to 137292.
Initial_biochemical_parameters_date – Date of the first biochemical test to measure some biochemical parameter of the patient, extracted from the Biochemical_parameters.csv file. Format: YYYY-MM-DD.
Final_biochemical_parameters_date – Date of the last biochemical test to measure some biochemical parameter of the patient, extracted from the Biochemical_parameters.csv file. Format: YYYY-MM-DD.
Number_of_biochemical_parameters – Number of biochemical parameters measured on the patient, extracted from the Biochemical_parameters.csv file. Values: ranging from 4 to 846.
Number_of_diagnostics – Number of diagnoses realized to the patient, extracted from the Diagnostics.csv file. Values: ranging from 1 to 24.
Glucose_measurements.csv
Glucose_measurements.csv is the file containing the continuous blood glucose level measurements of the patients. The file is composed of more than 22.6 million records that constitute the time series of continuous blood glucose level measurements. It includes the following variables:
Patient_ID – Unique identifier of the patient. Format: LIB19XXXX.
Measurement_date – Date of the blood glucose level measurement. Format: YYYY-MM-DD.
Measurement_time – Time of the blood glucose level measurement. Format: HH:MM:SS.
Measurement – Value of the blood glucose level measurement in mg/dL. Values: ranging from 40 to 500.
Biochemical_parameters.csv
Biochemical_parameters.csv is the file containing data of the biochemical tests performed on patients to measure their biochemical parameters. This file is composed of 87482 records and includes the following variables:
Patient_ID – Unique identifier of the patient. Format: LIB19XXXX.
Reception_date – Date of receipt in the laboratory of the sample to measure the biochemical parameter. Format: YYYY-MM-DD.
Name – Name of the measured biochemical parameter. Values: 'Potassium', 'HDL cholesterol', 'Gammaglutamyl Transferase (GGT)', 'Creatinine', 'Glucose', 'Uric acid', 'Triglycerides', 'Alanine transaminase (GPT)', 'Chlorine', 'Thyrotropin (TSH)', 'Sodium', 'Glycated hemoglobin (Ac)', 'Total cholesterol', 'Albumin (urine)', 'Creatinine (urine)', 'Insulin', 'IA ANTIBODIES'.
Value – Value of the biochemical parameter. Values: ranging from -4.0 to 6446.74.
Diagnostics.csv
Diagnostics.csv is the file containing diagnoses of diabetes mellitus complications or other diseases that patients have in addition to type 1 diabetes mellitus. This file is composed of 1757 records and includes the following variables:
Patient_ID – Unique identifier of the patient. Format: LIB19XXXX.
Code – ICD-9-CM diagnosis code. Values: subset of 594 of the ICD-9-CM codes (https://www.cms.gov/Medicare/Coding/ICD9ProviderDiagnosticCodes/codes).
Description – ICD-9-CM long description. Values: subset of 594 of the ICD-9-CM long description (https://www.cms.gov/Medicare/Coding/ICD9ProviderDiagnosticCodes/codes).
Technical Validation
Blood glucose level measurements are collected using FreeStyle Libre devices, which are widely used for healthcare in patients with T1D. Abbott Diabetes Care, Inc., Alameda, CA, USA, the manufacturer company, has conducted validation studies of these devices concluding that the measurements made by their sensors compare to YSI analyzer devices (Xylem Inc.), the gold standard, yielding results of 99.9% of the time within zones A and B of the consensus error grid. In addition, other studies external to the company concluded that the accuracy of the measurements is adequate.
Moreover, it was also checked in most cases the blood glucose level measurements per patient were continuous (i.e. a sample at least every 15 minutes) in the Glucose_measurements.csv file as they should be.
Usage Notes
For data downloading, it is necessary to be authenticated on the Zenodo platform, accept the Data Usage Agreement and send a request specifying full name, email, and the justification of the data use. This request will be processed by the Secretary of the Department of Computer Engineering, Automatics, and Robotics of the University of Granada and access to the dataset will be granted.
The files that compose the dataset are CSV type files delimited by commas and are available in T1DiabetesGranada.zip. A Jupyter Notebook (Python v. 3.8) with code that may help to a better understanding of the dataset, with graphics and statistics, is available in UsageNotes.zip.
Graphs_and_stats.ipynb
The Jupyter Notebook generates tables, graphs and statistics for a better understanding of the dataset. It has four main sections, one dedicated to each file in the dataset. In addition, it has useful functions such as calculating the patient age, deleting a patient list from a dataset file and leaving only a patient list in a dataset file.
Code Availability
The dataset was generated using some custom code located in CodeAvailability.zip. The code is provided as Jupyter Notebooks created with Python v. 3.8. The code was used to conduct tasks such as data curation and transformation, and variables extraction.
Original_patient_info_curation.ipynb
In the Jupyter Notebook is preprocessed the original file with patient data. Mainly irrelevant rows and columns are removed, and the sex variable is recoded.
Glucose_measurements_curation.ipynb
In the Jupyter Notebook is preprocessed the original file with the continuous glucose level measurements of the patients. Principally rows without information or duplicated rows are removed and the variable with the timestamp is transformed into two new variables, measurement date and measurement time.
Biochemical_parameters_curation.ipynb
In the Jupyter Notebook is preprocessed the original file with patient data of the biochemical tests performed on patients to measure their biochemical parameters. Mainly irrelevant rows and columns are removed and the variable with the name of the measured biochemical parameter is translated.
Diagnostic_curation.ipynb
In the Jupyter Notebook is preprocessed the original file with patient data of the diagnoses of diabetes mellitus complications or other diseases that patients have in addition to T1D.
Get_patient_info_variables.ipynb
In the Jupyter Notebook it is coded the feature extraction process from the files Glucose_measurements.csv, Biochemical_parameters.csv and Diagnostics.csv to complete the file Patient_info.csv. It is divided into six sections, the first three to extract the features from each of the mentioned files and the next three to add the extracted features to the resulting new file.
Data Usage Agreement
The conditions for use are as follows:
You confirm that you will not attempt to re-identify research participants for any reason, including for re-identification theory research.
You commit to keeping the T1DiabetesGranada dataset confidential and secure and will not redistribute data or Zenodo account credentials.
You will require
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data from: Rates of Compact Object Coalescence
Brief overview:
This Zenodo entry contains the data that has been used to make the figures for the living review "Rates of Compact Object Coalescence" by Ilya Mandel & Floor Broekgaarden (2021). To reproduce the figures, download all the *.csv files and run the jupyter notebook created to reproduce the results in the publicly available Github directory https://github.com/FloorBroekgaarden/Rates_of_Compact_Object_Coalescence (the exact jupyter notebook can be found here)
For any suggestions, questions or inquiry, please email one, or both, of the authors:
We very much welcome suggestions for additional/missing literature with rate predictions or measurements.
Extra figures:
Extra figures that can be used can be found here:
Vertical figures: https://docs.google.com/presentation/d/1GqJ0k2zpnxBGwIYNeQ0BfsLSU7H2942gspL-PN_iaJY/edit?usp=sharing
The authors are currently working on making an interactive tool for plotting the rates that will be available soon. In the mean time, feel free to send requests for plots/figures to the authors.
Reference
If you use this data/code for publication, please cite both the paper: Mandel & Broekgaarden (2021) (https://ui.adsabs.harvard.edu/abs/2021arXiv210714239M/abstract) and the dataset on Zenodo through it's doi (see tabs on the right of this zenodo entry)
Details datafiles:
The PDF COC_rates_supplementary_material.pdf attached (and in the Github repository) describes how each of the rates in the data files of this Zenodo entry are retrieved. The other 26 files are .csv files, where each csv file contains the rates from one specific double compact object type: NS-NS, NS-BH or BH-BH, and specific rate group (isolated binary evolution, gravitational wave observations etc.). The files in this entry are:
Each csv file contains the following header:
ADS year # year of the paper in the ADS entry
ADS month # month of the paper in the ADS entry
ADS abstract link # link to the ADS abstract
ArXiv link # link to the ArXiv version of the paper
First Author # name of the first author
label string # label of the study, that corresponds to the label in the figure
code (optional) # name of the code used in this study
type of limit (for plotting, see jupyter notebook for a dictionary) # integer, that is used to map to a certain limit visualization in the plot (e.g. scatter points vs upper limit).
Each entry takes two columns in the csv files. One for the rates (quoted under the header 'rate [Gpc^-3 yr^-1]') and one for "notes" where we sometimes added notes about the rates (such as whether it is an upper or lower limit).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This resource contains Jupyter Python notebooks which are intended to be used to learn about the U.S. National Water Model (NWM). These notebooks explore NWM forecasts in various ways. NWM Notebooks 1, 2, and 3, access NWM forecasts directly from the NOAA NOMADS file sharing system. Notebook 4 accesses NWM forecasts from Google Cloud Platform (GCP) storage in addition to NOMADS. A brief summary of what each notebook does is included below:
Notebook 1 (NWM1_Visualization) focuses on visualization. It includes functions for downloading and extracting time series forecasts for any of the 2.7 million stream reaches of the U.S. NWM. It also demonstrates ways to visualize forecasts using Python packages like matplotlib.
Notebook 2 (NWM2_Xarray) explores methods for slicing and dicing NWM NetCDF files using the python library, XArray.
Notebook 3 (NWM3_Subsetting) is focused on subsetting NWM forecasts and NetCDF files for specified reaches and exporting NWM forecast data to CSV files.
Notebook 4 (NWM4_Hydrotools) uses Hydrotools, a new suite of tools for evaluating NWM data, to retrieve NWM forecasts both from NOMADS and from Google Cloud Platform storage where older NWM forecasts are cached. This notebook also briefly covers visualizing, subsetting, and exporting forecasts retrieved with Hydrotools.
NOTE: Notebook 4 Requires a newer version of NumPy that is not available on the default CUAHSI JupyterHub instance. Please use the instance "HydroLearn - Intelligent Earth" and ensure to run !pip install hydrotools.nwm_client[gcp].
The notebooks are part of a NWM learning module on HydroLearn.org. When the associated learning module is complete, the link to it will be added here. It is recommended that these notebooks be opened through the CUAHSI JupyterHub App on Hydroshare. This can be done via the 'Open With' button at the top of this resource page.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Replication Package
This repository contains data and source files needed to replicate our work described in the paper "Unboxing Default Argument Breaking Changes in Scikit Learn".
Requirements
We recommend the following requirements to replicate our study:
Package Structure
We relied on Docker containers to provide a working environment that is easier to replicate. Specifically, we configure the following containers:
data-analysis, an R-based Container we used to run our data analysis.data-collection, a Python Container we used to collect Scikit's default arguments and detect them in client applications.database, a Postgres Container we used to store clients' data, obtainer from Grotov et al.storage, a directory used to store the data processed in data-analysis and data-collection. This directory is shared in both containers.docker-compose.yml, the Docker file that configures all containers used in the package.In the remainder of this document, we describe how to set up each container properly.
Using VSCode to Setup the Package
We selected VSCode as the IDE of choice because its extensions allow us to implement our scripts directly inside the containers. In this package, we provide configuration parameters for both data-analysis and data-collection containers. This way you can directly access and run each container inside it without any specific configuration.
You first need to set up the containers
$ cd /replication/package/folder
$ docker-compose build
$ docker-compose up
# Wait docker creating and running all containers
Then, you can open them in Visual Studio Code:
If you want/need a more customized organization, the remainder of this file describes it in detail.
Longest Road: Manual Package Setup
Database Setup
The database container will automatically restore the dump in dump_matroskin.tar in its first launch. To set up and run the container, you should:
Build an image:
$ cd ./database
$ docker build --tag 'dabc-database' .
$ docker image ls
REPOSITORY TAG IMAGE ID CREATED SIZE
dabc-database latest b6f8af99c90d 50 minutes ago 18.5GB
Create and enter inside the container:
$ docker run -it --name dabc-database-1 dabc-database
$ docker exec -it dabc-database-1 /bin/bash
root# psql -U postgres -h localhost -d jupyter-notebooks
jupyter-notebooks=# \dt
List of relations
Schema | Name | Type | Owner
--------+-------------------+-------+-------
public | Cell | table | root
public | Code_cell | table | root
public | Md_cell | table | root
public | Notebook | table | root
public | Notebook_features | table | root
public | Notebook_metadata | table | root
public | repository | table | root
If you got the tables list as above, your database is properly setup.
It is important to mention that this database is extended from the one provided by Grotov et al.. Basically, we added three columns in the table Notebook_features (API_functions_calls, defined_functions_calls, andother_functions_calls) containing the function calls performed by each client in the database.
Data Collection Setup
This container is responsible for collecting the data to answer our research questions. It has the following structure:
dabcs.py, extract DABCs from Scikit Learn source code, and export them to a CSV file.dabcs-clients.py, extract function calls from clients and export them to a CSV file. We rely on a modified version of Matroskin to leverage the function calls. You can find the tool's source code in the `matroskin`` directory.Makefile, commands to set up and run both dabcs.py and dabcs-clients.pymatroskin, the directory containing the modified version of matroskin tool. We extended the library to collect the function calls performed on the client notebooks of Grotov's dataset.storage, a docker volume where the data-collection should save the exported data. This data will be used later in Data Analysis.requirements.txt, Python dependencies adopted in this module.Note that the container will automatically configure this module for you, e.g., install dependencies, configure matroskin, download scikit learn source code, etc. For this, you must run the following commands:
$ cd ./data-collection
$ docker build --tag "data-collection" .
$ docker run -it -d --name data-collection-1 -v $(pwd)/:/data-collection -v $(pwd)/../storage/:/data-collection/storage/ data-collection
$ docker exec -it data-collection-1 /bin/bash
$ ls
Dockerfile Makefile config.yml dabcs-clients.py dabcs.py matroskin storage requirements.txt utils.py
If you see project files, it means the container is configured accordingly.
Data Analysis Setup
We use this container to conduct the analysis over the data produced by the Data Collection container. It has the following structure:
dependencies.R, an R script containing the dependencies used in our data analysis.data-analysis.Rmd, the R notebook we used to perform our data analysisdatasets, a docker volume pointing to the storage directory.Execute the following commands to run this container:
$ cd ./data-analysis
$ docker build --tag "data-analysis" .
$ docker run -it -d --name data-analysis-1 -v $(pwd)/:/data-analysis -v $(pwd)/../storage/:/data-collection/datasets/ data-analysis
$ docker exec -it data-analysis-1 /bin/bash
$ ls
data-analysis.Rmd datasets dependencies.R Dockerfile figures Makefile
If you see project files, it means the container is configured accordingly.
A note on storage shared folder
As mentioned, the storage folder is mounted as a volume and shared between data-collection and data-analysis containers. We compressed the content of this folder due to space constraints. Therefore, before starting working on Data Collection or Data Analysis, make sure you extracted the compressed files. You can do this by running the Makefile inside storage folder.
$ make unzip # extract files
$ ls
clients-dabcs.csv clients-validation.csv dabcs.csv Makefile scikit-learn-versions.csv versions.csv
$ make zip # compress files
$ ls
csv-files.tar.gz Makefile
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Replication Data for: “Formal Models of Opinion Formation and their Application to Real Data: Evidence from Online Social Networks” The repository includes three files: Code.ipynb, X_friends.npz, and X_opinions.csv. The first one is a notebook document used by Jupyter Notebook, which stores the Python code to replicate the results from the Manuscript. To open this file one need install Python and Jupyter. The second file contains the matrix A of users’ friendship connections. The matrix representation is the sparse csr-format. The file X_opinions.csv consists of the matrices X ̂ and X ̂_- stacked together along the horizontal axis. The last two files keep the data, which was originally introduced in (Kozitsin et al., 2019) and located on the repository https://doi.org/10.7910/DVN/OUZY74. For convenience of usage, we copy it here. References Kozitsin, I. V., Marchenko, A. M., Goiko, V. L., & Palkin, R. V. (2019). Symmetric Convex Mechanism of Opinion Formation Predicts Directions of Users’ Opinions Trajectories. 1–5.
Facebook
TwitterLoad, wind and solar, prices in hourly resolution. This data package contains different kinds of timeseries data relevant for power system modelling, namely electricity consumption (load) for 36 European countries as well as wind and solar power generation and capacities and prices for a growing subset of countries. The timeseries become available at different points in time depending on the sources. The data has been downloaded from the sources, resampled and merged in a large CSV file with hourly resolution. Additionally, the data available at a higher resolution (Some renewables in-feed, 15 minutes) is provided in a separate file. All data processing is conducted in python and pandas and has been documented in the Jupyter notebooks linked below.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Jupyter Notebook shared here determines X and Y indices of the National Water Model grid cells that contain snow telemetry (SNOTEL) sites. It uses two inputs: one CSV file that includes SNOTEL site information and one NetCDF file that is a land surface model output of the NWM reanalysis results. You can open this resource with CUAHSI JupyterHub and run the notebook within the code folder. The output is a CSV file that gives X and Y indices of the National Water Model grid cells associated with each SNOTEL site.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
The Open-Orca Augmented FLAN Collection is a revolutionary dataset that unlocks new levels of language understanding and machine learning model performance. This dataset was created to support research on natural language processing, machine learning models, and language understanding through leveraging the power of reasoning trace-enhancement techniques. By enabling models to understand complex relationships between words, phrases, and even entire sentences in a more robust way than ever before, this dataset provides researchers expanded opportunities for furthering the progress of linguistics research. With its unique combination of features including system prompts, questions from users and responses from systems, this dataset opens up exciting possibilities for deeper exploration into the cutting edge concepts underlying advanced linguistics applications. Experience a new level of accuracy and performance - explore Open-Orca Augmented FLAN Collection today!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This guide provides an introduction to the Open-Orca Augmented FLAN Collection dataset and outlines how researchers can utilize it for their language understanding and natural language processing (NLP) work. The Open-Orca dataset includes system prompts, questions posed by users, and responses from the system.
Getting Started The first step is to download the data set from Kaggle at https://www.kaggle.com/openai/open-orca-augmented-flan and save it in a project directory of your choice on your computer or cloud storage space. Once you have downloaded the data set, launch your ‘Jupyter Notebook’ or ‘Google Colab’ program with which you want to work with this data set.
Exploring & Preprocessing Data: To get a better understanding of the features in this dataset, import them into Pandas DataFrame as shown below. You can use other libraries as per your need:
import pandas as pd # Library used for importing datasets into Python df = pd.read_csv('train.csv') #Imports train csv file into Pandas};#DataFrame df[['system_prompt','question','response']].head() #Views top 5 rows with columns 'system_prompt','question','response'After importing check each feature using basic descriptive statistics such Pandas groupby statement: We can use groupby statements to have greater clarity over the variables present in each feature(elements). The below command will show counts of each element in System Prompt column present under train CVS file :
df['system prompt'].value_counts().head()#shows count of each element present under 'System Prompt'column Output: User says hello guys 587 <br>System asks How are you?: 555 times<br>User says I am doing good: 487 times <br>..and so onData Transformation: After inspecting & exploring different features one may want/need certain changes that best suits their needs from this dataset before training modeling algorithms on it.
Common transformation steps include : Removing punctuation marks : Since punctuation marks may not add any value to computation operations , we can remove them using regex functions write .replace('[^A-Za -z]+','' ) as
- Automated Question Answering: Leverage the dataset to train and develop question answering models that can provide tailored answers to specific user queries while retaining language understanding abilities.
- Natural Language Understanding: Use the dataset as an exploratory tool for fine-tuning natural language processing applications, such as sentiment analysis, document categorization, parts-of-speech tagging and more.
- Machine Learning Optimizations: The dataset can be used to build highly customized machine learning pipelines that allow users to harness the power of conditioning data with pre-existing rules or models for improved accuracy and performance in automated tasks
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. [See Other Information](ht...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Datasets:
"Figure1a.csv": scattering intensity of hydrated proteins in Wide-Angle X-ray Scattering for different fluences (in units of photons/second/area).
"Figure1a_inset.csv": scattering intensity of hydrated proteins in Small-Angle X-ray Scattering for different fluences (in units of photons/second/area).
"Figure1b.csv": Intensity autocorrelation functions g2 at momentum transfer Q = 0.08 1/nm for different fluences (in units of photons/second/area).
"Figure1b_inset.csv": decay rate (in second) as a function of the momentum transfer Q (in 1/nm) for different fluences (in units of photons/second/area).
"Figure1c.csv": decay rate (in second) for variable fluence (in photons/second/um^2) at the momentum transfer Q = 0.08 1/nm.
"Figure1d.csv": renormalised intensity autocorrelation functions g2 at momentum transfer Q = 0.08 1/nm for variable fluence (in photons/second/um^2), where the time axis is normalised to the corresponding fluence F by calculating t/(1 + a · F·τ0), where τ0 is the equilibrium time constant extracted by extrapolation to F=0 (from data in "Figure1c.csv)"
"Figure2a.csv": The Wide-Angle X-ray Scattering scattering intensity at different temperatures T=180-290 K
"Figure2b.csv": The Small-Angle X-ray Scattering scattering intensity at different temperatures T=180-290 K
"Figure2c.csv": Intensity autocorrelation functions g2 for different temperatures (T=180-290 K) at momentum transfer Q = 0.1 1/nm.
"Figure2d-2e.csv": time constants (in second) and the Kohlrausch-Williams-Watts (KWW) exponent extracted from the fits of data in "Figure2c.csv" as a function of temperature (in K)
"Figure3b.csv": The normalised variance Chi_T at different temperatures (T=180-290 K) extracted from the two-time correlation functions.
"Figure3c.csv": The maximum of the normalised variance Chi_0 as a function of temperature (in K).
Additionally, a Jupyter notebook "open-data.ipynb" which shows how to load and plot the data from the csv files in Python.
Facebook
TwitterThis dataset is important as it can help users find good quality videos more easily. The data was collected using the Youtube API and includes a total of _ videos
Columns: Channel title, view count, like count, comment count, definition, caption, subscribers, total views, average polarity score, label
In order to use this dataset, you will need to have the following: -A YouTube API key -A text editor (e.g. Notepad++, Sublime Text, etc.)
Once you have collected these items, you can begin using the dataset. Here is a step-by-step guide: 1) Navigate to the folder where you saved the dataset. 2) Right-click on the file and select Open with > Your text editor. 3) copy your YouTube API key and paste it in place of Your_API_Key in line 4 of the code. 4) Save the file and close your text editor. 5) Navigate to the folder in your terminal/command prompt and type jupyter notebook. This will open a Jupyter Notebook in your browser window.
This dataset can be used for a number of different things including: 1. Finding good quality videos on youtube 2. Determining which videos are more likely to be reputable 3. Helping people find videos they will enjoy
The data for this dataset was collected using the Youtube API and includes a total of _ videos
See the dataset description for more information.
File: dataframeclean.csv | Column name | Description | |:-----------------------|:--------------| | **** | | | channelTitle | | | viewCount | | | likeCount | | | commentCount | | | definition | | | caption | | | subscribers | | | totalViews | | | avg polarity score | | | Label | | | pushblishYear | | | durationSecs | | | tagCount | | | title length | | | description length | |
File: ytdataframe.csv | Column name | Description | |:-----------------------|:------------------------------------------------------------| | **** | | | channelTitle | | | viewCount | | | likeCount | | | commentCount | | | definition | | | caption | | | subscribers | | | totalViews | | | avg polarity score | | | Label | | | title | The title of the video. (String) | | description | A description of the video. (String) | | tags | The tags associated with the video. (String) | | publishedAt | The date and time the video was published. (String) | | favouriteCount | The number of times the video has been favorited. (Integer) | | duration | The length of the video in seconds. (Integer) |
File: ytdataframe2.csv | Column name | Description | |:-----------------------|:------------------------------------------------------------| | **** | | | channelTitle | | | title | The title of the video. (String) | | description | A description of the video. (String) | | tags | The tags associated with the video. (String) | | publishedAt | The date and time the video was published. (String) | | viewCount | | | **...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset compares four cities FIXED-line broadband internet speeds: - Melbourne, AU - Bangkok, TH - Shanghai, CN - Los Angeles, US - Alice Springs, AU
ERRATA: 1.Data is for Q3 2020, but some files are labelled incorrectly as 02-20 of June 20. They all should read Sept 20, or 09-20 as Q3 20, rather than Q2. Will rename and reload. Amended in v7.
*lines of data for each geojson file; a line equates to a 600m^2 location, inc total tests, devices used, and average upload and download speed - MEL 16181 locations/lines => 0.85M speedtests (16.7 tests per 100people) - SHG 31745 lines => 0.65M speedtests (2.5/100pp) - BKK 29296 lines => 1.5M speedtests (14.3/100pp) - LAX 15899 lines => 1.3M speedtests (10.4/100pp) - ALC 76 lines => 500 speedtests (2/100pp)
Geojsons of these 2* by 2* extracts for MEL, BKK, SHG now added, and LAX added v6. Alice Springs added v15.
This dataset unpacks, geospatially, data summaries provided in Speedtest Global Index (linked below). See Jupyter Notebook (*.ipynb) to interrogate geo data. See link to install Jupyter.
** To Do Will add Google Map versions so everyone can see without installing Jupyter. - Link to Google Map (BKK) added below. Key:Green > 100Mbps(Superfast). Black > 500Mbps (Ultrafast). CSV provided. Code in Speedtestv1.1.ipynb Jupyter Notebook. - Community (Whirlpool) surprised [Link: https://whrl.pl/RgAPTl] that Melb has 20% at or above 100Mbps. Suggest plot Top 20% on map for community. Google Map link - now added (and tweet).
** Python melb = au_tiles.cx[144:146 , -39:-37] #Lat/Lon extract shg = tiles.cx[120:122 , 30:32] #Lat/Lon extract bkk = tiles.cx[100:102 , 13:15] #Lat/Lon extract lax = tiles.cx[-118:-120, 33:35] #lat/Lon extract ALC=tiles.cx[132:134, -22:-24] #Lat/Lon extract
Histograms (v9), and data visualisations (v3,5,9,11) will be provided. Data Sourced from - This is an extract of Speedtest Open data available at Amazon WS (link below - opendata.aws).
**VERSIONS v.24 Add tweet and google map of Top 20% (over 100Mbps locations) in Mel Q322. Add v.1.5 MEL-Superfast notebook, and CSV of results (now on Google Map; link below). v23. Add graph of 2022 Broadband distribution, and compare 2020 - 2022. Updated v1.4 Jupyter notebook. v22. Add Import ipynb; workflow-import-4cities. v21. Add Q3 2022 data; five cities inc ALC. Geojson files. (2020; 4.3M tests 2022; 2.9M tests)
v20. Speedtest - Five Cities inc ALC. v19. Add ALC2.ipynb. v18. Add ALC line graph. v17. Added ipynb for ALC. Added ALC to title.v16. Load Alice Springs Data Q221 - csv. Added Google Map link of ALC. v15. Load Melb Q1 2021 data - csv. V14. Added Melb Q1 2021 data - geojson. v13. Added Twitter link to pics. v12 Add Line-Compare pic (fastest 1000 locations) inc Jupyter (nbn-intl-v1.2.ipynb). v11 Add Line-Compare pic, plotting Four Cities on a graph. v10 Add Four Histograms in one pic. v9 Add Histogram for Four Cities. Add NBN-Intl.v1.1.ipynb (Jupyter Notebook). v8 Renamed LAX file to Q3, rather than 03. v7 Amended file names of BKK files to correctly label as Q3, not Q2 or 06. v6 Added LAX file. v5 Add screenshot of BKK Google Map. v4 Add BKK Google map(link below), and BKK csv mapping files. v3 replaced MEL map with big key version. Prev key was very tiny in top right corner. v2 Uploaded MEL, SHG, BKK data and Jupyter Notebook v1 Metadata record
** LICENCE AWS data licence on Speedtest data is "CC BY-NC-SA 4.0", so use of this data must be: - non-commercial (NC) - reuse must be share-alike (SA)(add same licence). This restricts the standard CC-BY Figshare licence.
** Other uses of Speedtest Open Data; - see link at Speedtest below.