32 datasets found
  1. csv file for jupyter notebook

    • figshare.com
    txt
    Updated Nov 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Johanna Schultz (2022). csv file for jupyter notebook [Dataset]. http://doi.org/10.6084/m9.figshare.21590175.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Nov 21, 2022
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Johanna Schultz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    df_force_kin_filtered.csv is the data sheet used for the DATA3 python notebook to analyse kinematics and dynamics combined. It contains the footfalls that hava data for both: kinematics and dynamics. To see how this file is generated, read the first half of the jupyter notebook

  2. v

    Update CSV item in ArcGIS

    • anrgeodata.vermont.gov
    Updated Mar 18, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ArcGIS Survey123 (2022). Update CSV item in ArcGIS [Dataset]. https://anrgeodata.vermont.gov/documents/dc69467c3e7243719c9125679bbcee9b
    Explore at:
    Dataset updated
    Mar 18, 2022
    Dataset authored and provided by
    ArcGIS Survey123
    Description

    ArcGIS Survey123 utilizes CSV data in several workflows, including external choice lists, the search() appearance, and pulldata() calculations. When you need to periodically update the CSV content used in a survey, a useful method is to upload the CSV files to your ArcGIS organization and link the CSV items to your survey. Once linked, any updates to the CSV items will automatically pull through to your survey without the need to republish the survey. To learn more about linking items to a survey, see Linked content.This notebook demonstrates how to automate updating a CSV item in your ArcGIS organization.Note: It is recommended to run this notebook on your computer in Jupyter Notebook or ArcGIS Pro, as that will provide the best experience when reading locally stored CSV files. If you intend to schedule this notebook in ArcGIS Online or ArcGIS Notebook Server, additional configuration may be required to read CSV files from online file storage, such as Microsoft OneDrive or Google Drive.

  3. Speedtest Open Data - Four International cities - MEL, BKK, SHG, LAX plus...

    • figshare.com
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Richard Ferrers; Speedtest Global Index (2023). Speedtest Open Data - Four International cities - MEL, BKK, SHG, LAX plus ALC - 2020, 2022 [Dataset]. http://doi.org/10.6084/m9.figshare.13621169.v24
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Richard Ferrers; Speedtest Global Index
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset compares four cities FIXED-line broadband internet speeds: - Melbourne, AU - Bangkok, TH - Shanghai, CN - Los Angeles, US - Alice Springs, AU

    ERRATA: 1.Data is for Q3 2020, but some files are labelled incorrectly as 02-20 of June 20. They all should read Sept 20, or 09-20 as Q3 20, rather than Q2. Will rename and reload. Amended in v7.

    1. LAX file named 0320, when should be Q320. Amended in v8.

    *lines of data for each geojson file; a line equates to a 600m^2 location, inc total tests, devices used, and average upload and download speed - MEL 16181 locations/lines => 0.85M speedtests (16.7 tests per 100people) - SHG 31745 lines => 0.65M speedtests (2.5/100pp) - BKK 29296 lines => 1.5M speedtests (14.3/100pp) - LAX 15899 lines => 1.3M speedtests (10.4/100pp) - ALC 76 lines => 500 speedtests (2/100pp)

    Geojsons of these 2* by 2* extracts for MEL, BKK, SHG now added, and LAX added v6. Alice Springs added v15.

    This dataset unpacks, geospatially, data summaries provided in Speedtest Global Index (linked below). See Jupyter Notebook (*.ipynb) to interrogate geo data. See link to install Jupyter.

    ** To Do Will add Google Map versions so everyone can see without installing Jupyter. - Link to Google Map (BKK) added below. Key:Green > 100Mbps(Superfast). Black > 500Mbps (Ultrafast). CSV provided. Code in Speedtestv1.1.ipynb Jupyter Notebook. - Community (Whirlpool) surprised [Link: https://whrl.pl/RgAPTl] that Melb has 20% at or above 100Mbps. Suggest plot Top 20% on map for community. Google Map link - now added (and tweet).

    ** Python melb = au_tiles.cx[144:146 , -39:-37] #Lat/Lon extract shg = tiles.cx[120:122 , 30:32] #Lat/Lon extract bkk = tiles.cx[100:102 , 13:15] #Lat/Lon extract lax = tiles.cx[-118:-120, 33:35] #lat/Lon extract ALC=tiles.cx[132:134, -22:-24] #Lat/Lon extract

    Histograms (v9), and data visualisations (v3,5,9,11) will be provided. Data Sourced from - This is an extract of Speedtest Open data available at Amazon WS (link below - opendata.aws).

    **VERSIONS v.24 Add tweet and google map of Top 20% (over 100Mbps locations) in Mel Q322. Add v.1.5 MEL-Superfast notebook, and CSV of results (now on Google Map; link below). v23. Add graph of 2022 Broadband distribution, and compare 2020 - 2022. Updated v1.4 Jupyter notebook. v22. Add Import ipynb; workflow-import-4cities. v21. Add Q3 2022 data; five cities inc ALC. Geojson files. (2020; 4.3M tests 2022; 2.9M tests)

    Melb 14784 lines Avg download speed 69.4M Tests 0.39M

    SHG 31207 lines Avg 233.7M Tests 0.56M

    ALC 113 lines Avg 51.5M Test 1092

    BKK 29684 lines Avg 215.9M Tests 1.2M

    LAX 15505 lines Avg 218.5M Tests 0.74M

    v20. Speedtest - Five Cities inc ALC. v19. Add ALC2.ipynb. v18. Add ALC line graph. v17. Added ipynb for ALC. Added ALC to title.v16. Load Alice Springs Data Q221 - csv. Added Google Map link of ALC. v15. Load Melb Q1 2021 data - csv. V14. Added Melb Q1 2021 data - geojson. v13. Added Twitter link to pics. v12 Add Line-Compare pic (fastest 1000 locations) inc Jupyter (nbn-intl-v1.2.ipynb). v11 Add Line-Compare pic, plotting Four Cities on a graph. v10 Add Four Histograms in one pic. v9 Add Histogram for Four Cities. Add NBN-Intl.v1.1.ipynb (Jupyter Notebook). v8 Renamed LAX file to Q3, rather than 03. v7 Amended file names of BKK files to correctly label as Q3, not Q2 or 06. v6 Added LAX file. v5 Add screenshot of BKK Google Map. v4 Add BKK Google map(link below), and BKK csv mapping files. v3 replaced MEL map with big key version. Prev key was very tiny in top right corner. v2 Uploaded MEL, SHG, BKK data and Jupyter Notebook v1 Metadata record

    ** LICENCE AWS data licence on Speedtest data is "CC BY-NC-SA 4.0", so use of this data must be: - non-commercial (NC) - reuse must be share-alike (SA)(add same licence). This restricts the standard CC-BY Figshare licence.

    ** Other uses of Speedtest Open Data; - see link at Speedtest below.

  4. d

    Using HydroShare Buckets to Access Resource Files

    • search.dataone.org
    Updated Aug 9, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pabitra Dash (2025). Using HydroShare Buckets to Access Resource Files [Dataset]. https://search.dataone.org/view/sha256%3Ab25a0f5e5d62530d70ecd6a86f1bd3fa2ab804a8350dc7ba087327839fcb1fb1
    Explore at:
    Dataset updated
    Aug 9, 2025
    Dataset provided by
    Hydroshare
    Authors
    Pabitra Dash
    Description

    This resource contains a draft Jupyter Notebook that has example code snippets showing how to access HydroShare resource files using HydroShare S3 buckets. The user_account.py is a utility to read user hydroshare cached account information in any of the JupyterHub instances that HydroShare has access to. The example notebook uses this utility so that you don't have to enter your hydroshare account information in order to access hydroshare buckets.

    Here are the 3 notebooks in this resource:

    • hydroshare_s3_bucket_access_examples.ipynb:

    The above notebook has examples showing how to upload/download resource files from the resource bucket. It also contains examples how to list files and folders of a resource in a bucket.

    • python-modules-direct-read-from-bucket/hs_bucket_access_gdal_example.ipynb:

    The above notebook has examples for reading raster and shapefile from bucket using gdal without the need of downloading the file from the bucket to local disk.

    • python-modules-direct-read-from-bucket/hs_bucket_access_non_gdal_example.ipynb

    The above notebook has examples of using h5netcdf and xarray for reading netcdf file directly from bucket. It also contains examples of using rioxarray to read raster file, and pandas to read CSV file from hydroshare buckets.

  5. Amazon Web Scrapping Dataset

    • kaggle.com
    zip
    Updated Jun 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammad Hurairah (2023). Amazon Web Scrapping Dataset [Dataset]. https://www.kaggle.com/datasets/mohammadhurairah/amazon-web-scrapper-dataset
    Explore at:
    zip(2220 bytes)Available download formats
    Dataset updated
    Jun 17, 2023
    Authors
    Mohammad Hurairah
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Amazon Scrapping Dataset; 1. Import libraries 2. Connect to the website 3. Import CSV and datetime 4. Import pandas 5. Appending dataset to CSV 6. Automation Dataset updated 7. Timers setup 8. Email notification

  6. H

    JavaScript code for retrieval of MODIS Collection 6 NDSI snow cover at...

    • beta.hydroshare.org
    • hydroshare.org
    • +1more
    zip
    Updated Feb 11, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Irene Garousi-Nejad; David Tarboton (2022). JavaScript code for retrieval of MODIS Collection 6 NDSI snow cover at SNOTEL sites and a Jupyter Notebook to merge/reprocess data [Dataset]. http://doi.org/10.4211/hs.d287f010b2dd48edb0573415a56d47f8
    Explore at:
    zip(52.2 KB)Available download formats
    Dataset updated
    Feb 11, 2022
    Dataset provided by
    HydroShare
    Authors
    Irene Garousi-Nejad; David Tarboton
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Description

    This JavaScript code has been developed to retrieve NDSI_Snow_Cover from MODIS version 6 for SNOTEL sites using the Google Earth Engine platform. To successfully run the code, you should have a Google Earth Engine account. An input file, called NWM_grid_Western_US_polygons_SNOTEL_ID.zip, is required to run the code. This input file includes 1 km grid cells of the NWM containing SNOTEL sites. You need to upload this input file to the Assets tap in the Google Earth Engine code editor. You also need to import the MOD10A1.006 Terra Snow Cover Daily Global 500m collection to the Google Earth Engine code editor. You may do this by searching for the product name in the search bar of the code editor.

    The JavaScript works for s specified time range. We found that the best period is a month, which is the maximum allowable time range to do the computation for all SNOTEL sites on Google Earth Engine. The script consists of two main loops. The first loop retrieves data for the first day of a month up to day 28 through five periods. The second loop retrieves data from day 28 to the beginning of the next month. The results will be shown as graphs on the right-hand side of the Google Earth Engine code editor under the Console tap. To save results as CSV files, open each time-series by clicking on the button located at each graph's top right corner. From the new web page, you can click on the Download CSV button on top.

    Here is the link to the script path: https://code.earthengine.google.com/?scriptPath=users%2Figarousi%2Fppr2-modis%3AMODIS-monthly

    Then, run the Jupyter Notebook (merge_downloaded_csv_files.ipynb) to merge the downloaded CSV files that are stored for example in a folder called output/from_GEE into one single CSV file which is merged.csv. The Jupyter Notebook then applies some preprocessing steps and the final output is NDSI_FSCA_MODIS_C6.csv.

  7. o

    Population Distribution Workflow using Census API in Jupyter Notebook:...

    • openicpsr.org
    delimited
    Updated Jul 23, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cooper Goodman; Nathanael Rosenheim; Wayne Day; Donghwan Gu; Jayasaree Korukonda (2020). Population Distribution Workflow using Census API in Jupyter Notebook: Dynamic Map of Census Tracts in Boone County, KY, 2000 [Dataset]. http://doi.org/10.3886/E120382V1
    Explore at:
    delimitedAvailable download formats
    Dataset updated
    Jul 23, 2020
    Dataset provided by
    Texas A&M University
    Authors
    Cooper Goodman; Nathanael Rosenheim; Wayne Day; Donghwan Gu; Jayasaree Korukonda
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    2000
    Area covered
    Boone County
    Description

    This archive reproduces a figure titled "Figure 3.2 Boone County population distribution" from Wang and vom Hofe (2007, p.60). The archive provides a Jupyter Notebook that uses Python and can be run in Google Colaboratory. The workflow uses the Census API to retrieve data, reproduce the figure, and ensure reproducibility for anyone accessing this archive.The Python code was developed in Google Colaboratory, or Google Colab for short, which is an Integrated Development Environment (IDE) of JupyterLab and streamlines package installation, code collaboration, and management. The Census API is used to obtain population counts from the 2000 Decennial Census (Summary File 1, 100% data). Shapefiles are downloaded from the TIGER/Line FTP Server. All downloaded data are maintained in the notebook's temporary working directory while in use. The data and shapefiles are stored separately with this archive. The final map is also stored as an HTML file.The notebook features extensive explanations, comments, code snippets, and code output. The notebook can be viewed in a PDF format or downloaded and opened in Google Colab. References to external resources are also provided for the various functional components. The notebook features code that performs the following functions:install/import necessary Python packagesdownload the Census Tract shapefile from the TIGER/Line FTP Serverdownload Census data via CensusAPI manipulate Census tabular data merge Census data with TIGER/Line shapefileapply a coordinate reference systemcalculate land area and population densitymap and export the map to HTMLexport the map to ESRI shapefileexport the table to CSVThe notebook can be modified to perform the same operations for any county in the United States by changing the State and County FIPS code parameters for the TIGER/Line shapefile and Census API downloads. The notebook can be adapted for use in other environments (i.e., Jupyter Notebook) as well as reading and writing files to a local or shared drive, or cloud drive (i.e., Google Drive).

  8. Cognitive Fatigue

    • figshare.com
    csv
    Updated Nov 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rui Varandas; Inês Silveira; Hugo Gamboa (2025). Cognitive Fatigue [Dataset]. http://doi.org/10.6084/m9.figshare.28188143.v3
    Explore at:
    csvAvailable download formats
    Dataset updated
    Nov 5, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Rui Varandas; Inês Silveira; Hugo Gamboa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    1. Cognitive FatigueWhile executing the proposed tasks, the participants’ physiological signals were monitored using two biosignalsplux devices from PLUX Wireless Biosignals, Lisbon, Portugal, with a sampling frequency of 100 Hz a resolution of 16 bits (24 bits in the case of fNIRS). Six different sensors were used: EEG and fNIRS positioned around the F7 and F8 of the 10–20 system (dorsolateral prefrontal cortex is often used to assess CW and fatigue as well as cognitive states); ECG monitored an approximation of Lead I of the Einthoven system, EDA placed on the palm of the non-dominant hand; ACC was positioned on the right side of the head to measure head movement and overall posture changes, and the RIP sensor was attached to the upper-abdominal area to measure the respiration cycles—the combination of the three allows to infer about the response of the Autonomic Nervous System (ANS) of the human body, namely, the response of the sympathetic and parasympathetic nervous system.2.1. Experimental designCognitive fatigue (CF) is a phenomenon that arises following the prolonged engagement in mentally demanding cognitive tasks. Thus, we developed an experimental procedure that involved three demanding tasks: a digital lesson in Jupyter Notebook format, three repetitions of Corsi-Block task, and two repetitions of a concentration test.Before the Corsi-Block task and after the concentration task there were periods of baseline of two min. In our analysis, the first baseline period, although not explicitly present in the dataset, was designated as representing no CF, whereas the final baseline period was designated as representing the presence of CF. Between repetitions of the Corsi-Block task, there were periods of baseline of 15 s after the task and of 30 s before the beginning of each repetition of the task.2.2. Data recordingA data sample of 10 volunteer participants (4 females) aged between 22 and 48 years old (M = 28.2, SD = 7.6) took part in this study. All volunteers were recruited at NOVA School of Science and Technology, fluent in English, right-handed, none reported suffering from psychological disorders, and none reported taking regular medication. Written informed consent was obtained before participating and all Ethical Procedures approved by the Ethics Committee of NOVA University of Lisbon were thoroughly followed.In this study, we omitted the data from one participant due to the insufficient duration of data acquisition.2.3. Data labellingThe labels easy, difficult, very difficult and repeat found in the ECG_lesson_answers.txt files represent the subjects' opinion of the content read in the ECG lesson. The repeat label represents the most difficult level. It's called repeat because when you press it, the answer to the question is shown again. This system is based on the Anki system, which has been proposed and used to memorise information effectively. In addition, the PB description JSON files include timestamps indicating the start and end of cognitive tasks, baseline periods, and other events, which are useful for defining CF states as we defined in 2.1.2.4. Data descriptionBiosignals include EEG, fNIRS (not converted to oxi and deoxiHb), ECG, EDA, respiration (RIP), accelerometer (ACC), and push-button data (PB). All signals have already been converted to physical units. In each biosignal file, the first column corresponds to the timestamps.HCI features encompass keyboard, mouse, and screenshot data. Below is a Python code snippet for extracting screenshot files from the screenshots CSV file.import base64from os import mkdirfrom os.path import joinfile = '...'with open(file, 'r') as f: lines = f.readlines()for line in lines[1:]: timestamp = line.split(',')[0] code = line.split(',')[-1][:-2] imgdata = base64.b64decode(code) filename = str(timestamp) + '.jpeg' mkdir('screenshot') with open(join('screenshot', filename), 'wb') as f: f.write(imgdata)A characterization file containing age and gender information for all subjects in each dataset is provided within the respective dataset folder (e.g., D2_subject-info.csv). Other complementary files include (i) description of the pushbuttons to help segment the signals (e.g., D2_S2_PB_description.json) and (ii) labelling (e.g., D2_S2_ECG_lesson_results.txt). The files D2_Sx_results_corsi-block_board_1.json and D2_Sx_results_corsi-block_board_2.json show the results for the first and second iterations of the corsi-block task, where, for example, row_0_1 = 12 means that the subject got 12 pairs right in the first row of the first board, and row_0_2 = 12 means that the subject got 12 pairs right in the first row of the second board.
  9. Data from: Data and code from: Cultivation and dynamic cropping processes...

    • catalog.data.gov
    • agdatacommons.nal.usda.gov
    Updated Dec 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Data and code from: Cultivation and dynamic cropping processes impart land-cover heterogeneity within agroecosystems: a metrics-based case study in the Yazoo-Mississippi Delta (USA) [Dataset]. https://catalog.data.gov/dataset/data-and-code-from-cultivation-and-dynamic-cropping-processes-impart-land-cover-heterogene-f5f78
    Explore at:
    Dataset updated
    Dec 2, 2025
    Dataset provided by
    Agricultural Research Servicehttps://www.ars.usda.gov/
    Area covered
    Mississippi Delta, United States, Mississippi
    Description

    This dataset contains data and code from the manuscript:Heintzman, L.J., McIntyre, N.E., Langendoen, E.J., & Read, Q.D. (2024). Cultivation and dynamic cropping processes impart land-cover heterogeneity within agroecosystems: a metrics-based case study in the Yazoo-Mississippi Delta (USA). Landscape Ecology 39, 29 (2024). https://doi.org/10.1007/s10980-024-01797-0There are 14 rasters of land use and land cover data for the study region, in .tif format with associated auxiliary files, two shape files with county boundaries and study area extent, a CSV file with summary information derived from the rasters, and a Jupyter notebook containing Python code.The rasters included here represent an intermediate data product. Original unprocessed rasters from NASS CropScape are not included here, nor is the code to process them.List of filesMS_Delta_maps.zipMSDeltaCounties_UTMZone15N.shp: Depiction of the 19 counties (labeled) that intersect the Mississippi Alluvial Plain in western Mississippi.MS_Delta_MAP_UTMZone15N.shp: Depiction of the study area extent.mf8h_20082021.zipmf8h_XXXX.tif: Yearly, reclassified and majority filtered LULC data used to build comboall1.csv - derived from USDA NASS CropScape. There are 14 .tif files total for years 2008-2021. Each .tif file includes auxiliary files with the same file name and the following extensions: .tfw, .tif.aux.xml, .tif.ovr., .tif.vat.cpg., .tif.vat.dbf.comboall1.csv: Combined dataset of LULC information for all 14 years in study period.analysis.ipynb_.txt: Jupyter Notebook used to analyze comboall1.csv. Convert to .ipynb format to open with Jupyter.This research was conducted under USDA Agricultural Research Service, National Program 211 (Water Availability and Watershed Management).

  10. FiN-2: Larg-Scale Powerline Communication Dataset (Pt.1)

    • zenodo.org
    bin, png, zip
    Updated Jul 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christoph Balada; Christoph Balada; Max Bondorf; Sheraz Ahmed; Andreas Dengel; Andreas Dengel; Markus Zdrallek; Max Bondorf; Sheraz Ahmed; Markus Zdrallek (2024). FiN-2: Larg-Scale Powerline Communication Dataset (Pt.1) [Dataset]. http://doi.org/10.5281/zenodo.8328113
    Explore at:
    bin, zip, pngAvailable download formats
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Christoph Balada; Christoph Balada; Max Bondorf; Sheraz Ahmed; Andreas Dengel; Andreas Dengel; Markus Zdrallek; Max Bondorf; Sheraz Ahmed; Markus Zdrallek
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    # FiN-2 Large-Scale Real-World PLC-Dataset

    ## About
    #### FiN-2 dataset in a nutshell:
    FiN-2 is the first large-scale real-world dataset on data collected in a powerline communication infrastructure. Since the electricity grid is inherently a graph, our dataset could be interpreted as a graph dataset. Therefore, we use the word node to describe points (cable distribution cabinets) of measurement within the low-voltage electricity grid and the word edge to describe connections (cables) in between them. However, since these are PLC connections, an edge does not necessarily have to correspond to a real cable; more on this in our paper.
    FiN-2 shows measurements that relate to the nodes (voltage, total harmonic distortion) as well as to the edges (signal-to-noise ratio spectrum, tonemap). In total, FiN-2 is distributed across three different sites with a total of 1,930,762,116 node measurements each for the individual features and 638,394,025 edge measurements each for all 917 PLC channels. All data was collected over a 25-month period from mid-2020 to the end of 2022.
    We propose this dataset to foster research in the domain of grid automation and smart grid. Therefore, we provide different example use cases in asset management, grid state visualization, forecasting, predictive maintenance, and novelty detection. For more decent information on this dataset, please see our [paper](https://arxiv.org/abs/2209.12693).

    * * *
    ## Content
    FiN-2 dataset splits up into two compressed `csv-Files`: *nodes.csv* and *edges.csv*.

    All files are provided as a compressed ZIP file and are divided into four parts. The first part can be found in this repo, while the remaining parts can be found in the following:
    - https://zenodo.org/record/8328105
    - https://zenodo.org/record/8328108
    - https://zenodo.org/record/8328111

    ### Node data

    | id | ts | v1 | v2 | v3 | thd1 | thd2 | thd3 | phase_angle1 | phase_angle2 | phase_angle3 | temp |
    |----|----|----|----|----|----|----|----|----|----|----|----|----|----|
    |112|1605530460|236.5|236.4|236.0|2.9|2.5|2.4|120.0|119.8|120.0|35.3|
    |112|1605530520|236.9|236.6|236.6|3.1|2.7|2.5|120.1|119.8|120.0|35.3|
    |112|1605530580|236.2|236.4|236.0|3.1|2.7|2.5|120.0|120.0|119.9|35.5|

    - id / ts: Unique identifier of the node that is measured and timestemp of the measurement
    - v1/v2/v3: Voltage measurements of all three phases
    - thd1/thd2/thd3: Total harmonic distortion of all three phases
    - phase_angle1/2/3: Phase angle of all three phases
    - temp: Temperature in-circuit of the sensor inside a cable distribution unit (in °C)

    ### Edge data
    | src | dst | ts | snr0 | snr1 | snr2 | ... | snr916 |
    |----|----|----|----|----|----|----|----|
    |62|94|1605528900|70|72|45|...|-53|
    |62|32|1605529800|16|24|13|...|-51|
    |17|94|1605530700|37|25|24|...|-55|

    - src & dst & ts: Unique identifier of the source and target nodes where the spectrum is measured and time of measurement
    - snr0/snr1/.../snr916: 917 SNR measurements in tenths of a decibel (e.g. 50 --> 5dB).

    ### Metadata
    Metadata that is provided along with the data covers:

    - Number of cable joints
    - Cable properties (length, type, number of sections)
    - Relative position of the nodes (location, zero-centered gps)
    - Adjacent PV or wallbox installations
    - Year of installation w.r.t. the nodes and cables

    Since the electricity grid is part of the critical infrastructure, it is not possible to provide exact GPS locations.

    * * *
    ## Usage
    Simple data access using pandas:

    ```
    import pandas as pd

    nodes_file = "nodes.csv.gz" # /path/to/nodes.csv.gz
    edges_file = "edges.csv.gz" # /path/to/edges.csv.gz

    # read the first 10 rows
    data = pd.read_csv(nodes_file, nrows=10, compression='gzip')

    # read the row number 5 to 15
    data = pd.read_csv(nodes_file, nrows=10, skiprows=[i for i in range(1,6)], compression='gzip')

    # ... same for the edges
    ```

    Compressed csv-data format was used to make sharing as easy as possible, however it comes with significant drawbacks for machine learning. Due to the inherent graph structure, a single snapshot of the whole graph consists of a set of node and edge measurements. But due to timeouts, noise and other disturbances, nodes sometimes fail in collecting the data, wherefore the number of measurements for a specific timestamp differs. This, plus the high sparsity of the graph, leads to a high inefficiency when using the csv-format for an ML training.
    To utilize the data in an ML pipeline, we recommend other data formats like [datadings](https://datadings.readthedocs.io/en/latest/) or specialized database solutions like [VictoriaMetrics](https://victoriametrics.com/).


    ### Example use case (voltage forecasting)

    Forecasting of the voltage is one potential use cases. The Jupyter notebook provided in the repository gives an overview of how the dataset can be loaded, preprocessed and used for ML training. Thereby, a MinMax scaling was used as simple preprocessing and a PyTorch dataset class was created to handle the data. Furthermore, a vanilla autoencoder is utilized to process and forecast the voltage into the future.

  11. Using GeoData in Python

    • kaggle.com
    zip
    Updated Apr 14, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thomas (2019). Using GeoData in Python [Dataset]. https://www.kaggle.com/thomaskranzkowski/using-geodata-in-python
    Explore at:
    zip(4963704 bytes)Available download formats
    Dataset updated
    Apr 14, 2019
    Authors
    Thomas
    Description

    By this short introduction using geospatial data in Python I combine three different types of data sources which can be implemented in one map. For this purpose I start with reading a .csv with random adresses in order to request geo coordinates from Google using its API and creating a new dataframe. I continue reading a zip folder into python with data from Natural Earth and geocode my first dataframe into a geo dataframe with the characteristics of geometry. It´s possible as well to construct a geodataframe manuelly by geopandas. Reading then geo spatial data from GeoJSON allows me to gain more exactly Polygons of the German districts for plotting them with previous geo dataframes into a unique map.

    In a 2nd jupyter notebook I continued with Agglomerative and K-Means Clustering for the gdp per capita data by manipulating the Natural Earth data sheet.

    In a following project I plan to start with SVM algorithms on these geo data.

    view file "Using Geo Data in Python": https://bit.ly/2SN3oTl

    view file "Agglomerative and Kmeans Clustering": https://bit.ly/2SN3D0H

  12. Data Visualization of Weight Sensor and Event Detection of Aifi Store

    • data.niaid.nih.gov
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    João Diogo Falcão; Carlos Ruiz; Rahul S Hoskeri; Adeola Bannis; Shijia Pan; Hae Young Noh; Pei Zhang (2024). Data Visualization of Weight Sensor and Event Detection of Aifi Store [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4292483
    Explore at:
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    AiFi Inc.
    Stanford University
    University of California, Merced
    Carnegie Mellon University
    Authors
    João Diogo Falcão; Carlos Ruiz; Rahul S Hoskeri; Adeola Bannis; Shijia Pan; Hae Young Noh; Pei Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Aifi Store is an autonomus store for cashier-less shopping experience which is achieved by multi modal sensing (Vision modality, weight modality and location modality). Aifi Nano store layout (Fig 1) (Image Credits: AIM3S research paper).

    Overview: The store is organized in the gondola's and each gondola has shelfs that holds the products and each shelf has weight sensor plates. These weight sensor plates data is used to find the event trigger (pick up, put down or no event) from which we can find the weight of the product picked.

    Gondola is similar to vertical fixture consisting of horizontal shelfs in any normal store and in this case there are 5 to 6 shelfs in a Gondola. Every shelf again is composed of weight sensing plates, weight sensing modalities, there are around 12 plates on each shelf.

    Every plate has a sampling rate of 60Hz, so there are 60 samples collected every second from each plate

    The pick up event on the plate can be observed and marked when the weight sensor reading decreases with time and increases with time when the put down event happens.

    Event Detection:

    The event is said to be detected if the moving variance calculated from the raw weight sensor reading exceeds a set threshold of (10000gm^2 or 0.01kg^2) over the sliding window length of 0.5 seconds, which is half of the sampling rate of sensors (i.e 1 second).

    There are 3 types of events:

    Pick Up Event (Fig 2)= Object being taken from the particular gondola and shelf from the customer

    Put Down Event (Fig 3)= Object being placed back from the customer on that particular gondola and shelf

    No Event = (Fig 4)No object being picked up from that shelf

    NOTE:

    1.The python script must be in the same folder as of the weight.csv files and .csv files should not be placed in other subdirectories.

    2.The videos for the corresponding weight sensor data can be found in the "Videos folder" in the repository and are named similar to their corresponding ".csv" files.

    3.Each video files consists of video data from 13 different camera angles.

    Details of the weight sensor files:

    These weight.csv (Baseline cases and team particular cases ) files are from the AIFI CPS IoT 2020 week.There are over 50 cases in total and each file has 5 columns (Fig 5) (timestamp, reading (in grams), gondola, shelf, plate number).

    Each of these files have data of around 2-5 minutes or 120 seconds in the form of timestamp. In order to unpack date and time from timestamp use datetime module from python.

    Details of the product.csv files:

    There are product.csv files for each test cases and these files provide the detailed information about the product name, product location (gondola number, shelf number and plate number) in the store, product weight(in grams), also link to the image of the product.

    Instruction to run the script:

    To start analysing the weigh.csv files using the python script and plot the timeseries plot for corresponding files.

    Download the dataset.

    Make sure to place the python/ jupyter notebook file is in same directory as the .csv files.

    Install the requirements $ pip3 install -r requirements.txt

    Run the python script Plot.py $ python3 Plot.py

    After the script has run successfully you will find the corresponding folders of weight.csv files which contain the figures (weight vs timestamp) in the format

    Instruction to run the Jupyter Notebook:

    Run the Plot.ipynb file using Jupyter Notebook by placing .csv files in the same directory as the Plot.ipynb script.

                                       gondola_number,shelf_number.png
    
    
                                        Ex: 1,1.png (Fig 4) (Timeseries Graph)
    
  13. d

    Data from: Multi-task Deep Learning for Water Temperature and Streamflow...

    • catalog.data.gov
    Updated Nov 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Multi-task Deep Learning for Water Temperature and Streamflow Prediction (ver. 1.1, June 2022) [Dataset]. https://catalog.data.gov/dataset/multi-task-deep-learning-for-water-temperature-and-streamflow-prediction-ver-1-1-june-2022
    Explore at:
    Dataset updated
    Nov 11, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Description

    This item contains data and code used in experiments that produced the results for Sadler et. al (2022) (see below for full reference). We ran five experiments for the analysis, Experiment A, Experiment B, Experiment C, Experiment D, and Experiment AuxIn. Experiment A tested multi-task learning for predicting streamflow with 25 years of training data and using a different model for each of 101 sites. Experiment B tested multi-task learning for predicting streamflow with 25 years of training data and using a single model for all 101 sites. Experiment C tested multi-task learning for predicting streamflow with just 2 years of training data. Experiment D tested multi-task learning for predicting water temperature with over 25 years of training data. Experiment AuxIn used water temperature as an input variable for predicting streamflow. These experiments and their results are described in detail in the WRR paper. Data from a total of 101 sites across the US was used for the experiments. The model input data and streamflow data were from the Catchment Attributes and Meteorology for Large-sample Studies (CAMELS) dataset (Newman et. al 2014, Addor et. al 2017). The water temperature data were gathered from the National Water Information System (NWIS) (U.S. Geological Survey, 2016). The contents of this item are broken into 13 files or groups of files aggregated into zip files:

    1. input_data_processing.zip: A zip file containing the scripts used to collate the observations, input weather drivers, and catchment attributes for the multi-task modeling experiments
    2. flow_observations.zip: A zip file containing collated daily streamflow data for the sites used in multi-task modeling experiments. The streamflow data were originally accessed from the CAMELs dataset. The data are stored in csv and Zarr formats.
    3. temperature_observations.zip: A zip file containing collated daily water temperature data for the sites used in multi-task modeling experiments. The data were originally accessed via NWIS. The data are stored in csv and Zarr formats.
    4. temperature_sites.geojson: Geojson file of the locations of the water temperature and streamflow sites used in the analysis.
    5. model_drivers.zip: A zip file containing the daily input weather driver data for the multi-task deep learning models. These data are from the Daymet drivers and were collated from the CAMELS dataset. The data are stored in csv and Zarr formats.
    6. catchment_attrs.csv: Catchment attributes collatted from the CAMELS dataset. These data are used for the Random Forest modeling. For full metadata regarding these data see CAMELS dataset.
    7. experiment_workflow_files.zip: A zip file containing workflow definitions used to run multi-task deep learning experiments. These are Snakemake workflows. To run a given experiment, one would run (for experiment A) 'snakemake -s expA_Snakefile --configfile expA_config.yml'
    8. river-dl-paper_v0.zip: A zip file containing python code used to run multi-task deep learning experiments. This code was called by the Snakemake workflows contained in 'experiment_workflow_files.zip'.
    9. random_forest_scripts.zip: A zip file containing Python code and a Python Jupyter Notebook used to prepare data for, train, and visualize feature importance of a Random Forest model.
    10. plotting_code.zip: A zip file containing python code and Snakemake workflow used to produce figures showing the results of multi-task deep learning experiments.
    11. results.zip: A zip file containing results of multi-task deep learning experiments. The results are stored in csv and netcdf formats. The netcdf files were used by the plotting libraries in 'plotting_code.zip'. These files are for five experiments, 'A', 'B', 'C', 'D', and 'AuxIn'. These experiment names are shown in the file name.
    12. sample_scripts.zip: A zip file containing scripts for creating sample output to demonstrate how the modeling workflow was executed.
    13. sample_output.zip: A zip file containing sample output data. Similar files are created by running the sample scripts provided.
    A. Newman; K. Sampson; M. P. Clark; A. Bock; R. J. Viger; D. Blodgett, 2014. A large-sample watershed-scale hydrometeorological dataset for the contiguous USA. Boulder, CO: UCAR/NCAR. https://dx.doi.org/10.5065/D6MW2F4D

    N. Addor, A. Newman, M. Mizukami, and M. P. Clark, 2017. Catchment attributes for large-sample studies. Boulder, CO: UCAR/NCAR. https://doi.org/10.5065/D6G73C3Q

    Sadler, J. M., Appling, A. P., Read, J. S., Oliver, S. K., Jia, X., Zwart, J. A., & Kumar, V. (2022). Multi-Task Deep Learning of Daily Streamflow and Water Temperature. Water Resources Research, 58(4), e2021WR030138. https://doi.org/10.1029/2021WR030138

    U.S. Geological Survey, 2016, National Water Information System data available on the World Wide Web (USGS Water Data for the Nation), accessed Dec. 2020.

  14. f

    AU Mic b Samples

    • figshare.com
    application/x-gzip
    Updated Mar 10, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thomas Barclay (2020). AU Mic b Samples [Dataset]. http://doi.org/10.6084/m9.figshare.11314118.v1
    Explore at:
    application/x-gzipAvailable download formats
    Dataset updated
    Mar 10, 2020
    Dataset provided by
    figshare
    Authors
    Thomas Barclay
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This folder contains data needed to recreate figures from the AU Mic b discovery paper.Files and descriptions:Figure 1, top panel:F1_a.csv: two columns of TESS data, covering the first transit (green):time, fluxF1_b.csv: two columns of TESS data, covering the second transit (red):time, fluxF1_c.csv: transit model for TESS data (orange data) time, model median, model 5th percentile, model 95th percentileF1_d.csv: two columns of Spitzer data (purple dots):time, fluxF1_e.csv: transit model for Spitzer data (orange data) time, model median, model 5th percentile, model 95th percentileFigure 1, lower panel:F1_f.csv two columns of TESS data, covering the candidate planet transit (green):time, fluxF1_g.csv transit model for TESS data (orange data) time, model median, model 5th percentile, model 95th percentileExtended Data Figure 2, top two panels:ED2_a.csv: two columns of TESS data (black dots) time, fluxED2_b.csv: transit model (orange data) time, model median, model 5th percentile, model 95th percentileED2_c.csv: GP model (green data) time, model median, model 5th percentile, model 95th percentileED2_d.csv: combined model (red data) time, model median, model 5th percentile, model 95th percentileExtended Data Figure 2, third panel:ED2_e.csv: two columns of Spitzer data (black dots) time, fluxED2_f.csv: transit model (orange data) time, model median, model 5th percentile, model 95th percentileED2_g.csv: GP model (green data) time, model median, model 5th percentile, model 95th percentileED2_h.csv: combined model (red data) time, model median, model 5th percentile, model 95th percentileExtended Data Figure 2, lower panel:ED2_i.csv: two columns of TESS data (black dots) time, fluxED2_j.csv: transit model (orange data) time, model median, model 5th percentile, model 95th percentileED2_k.csv: GP model (green data) time, model median, model 5th percentile, model 95th percentileED2_l.csv: combined model (red data) time, model median, model 5th percentile, model 95th percentileExtended Data Figure 3:Samples from the MCMC model of AU Mic b. Samples are stored in a pymc3 trace file called aumicb_pymc3.tgz. The tile will need to be untarred first you can use tar -xzvf aumicb_pymc3.tgzThis is a custom data format for PyMC3 traces. Each chain goes inside a directory, and each directory contains a metadata json file, and a numpy compressed file.File can be read using sample code supplied. The code is in a jupyter notebook called AU_Mic_read_samples.ipynb. The full file will need to be run because the samples rely on the model being set up correctly.Several python packages are needed to run the notebook:numpy, matplotlib, lightkurve, exoplanet, pymc3, theano, scipy, corner, pandas, and astropy

  15. Speedtest Open Data - Australia(NZ) 2020-2025; Q220 - Q325 extract by Qtr

    • figshare.com
    txt
    Updated Oct 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Richard Ferrers; Speedtest Global Index (2025). Speedtest Open Data - Australia(NZ) 2020-2025; Q220 - Q325 extract by Qtr [Dataset]. http://doi.org/10.6084/m9.figshare.13370504.v43
    Explore at:
    txtAvailable download formats
    Dataset updated
    Oct 24, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Richard Ferrers; Speedtest Global Index
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    New Zealand, Australia
    Description

    This is an Australian extract of Speedtest Open data available at Amazon WS (link below - opendata.aws).AWS data licence is "CC BY-NC-SA 4.0", so use of this data must be:- non-commercial (NC)- reuse must be share-alike (SA)(add same licence).This restricts the standard CC-BY Figshare licence.A world speedtest open data was dowloaded (>400Mb, 7M lines of data). An extract of Australia's location (lat, long) revealed 88,000 lines of data (attached as csv).A Jupyter notebook of extract process is attached.See Binder version at Github - https://github.com/areff2000/speedtestAU.+> Install: 173 packages | Downgrade: 1 packages | Total download: 432MBBuild container time: approx - load time 25secs.=> Error: Timesout - BUT UNABLE TO LOAD GLOBAL DATA FILE (6.6M lines).=> Error: Overflows 8GB RAM container provided with global data file (3GB)=> On local JupyterLab M2 MBP; loads in 6 mins.Added Binder from ARDC service: https://binderhub.rc.nectar.org.auDocs: https://ardc.edu.au/resource/fair-for-jupyter-notebooks-a-practical-guide/A link to Twitter thread of outputs provided.A link to Data tutorial provided (GitHub), including Jupyter Notebook to analyse World Speedtest data, selecting one US State.Data Shows: (Q220)- 3.1M speedtests | 762,000 devices |- 88,000 grid locations (600m * 600m), summarised as a point- average speed 33.7Mbps (down), 12.4M (up) | Max speed 724Mbps- data is for 600m * 600m grids, showing average speed up/down, number of tests, and number of users (IP). Added centroid, and now lat/long.See tweet of image of centroids also attached.NB: Discrepancy Q2-21, Speedtest Global shows June AU average speedtest at 80Mbps, whereas Q2 mean is 52Mbps (v17; Q1 45Mbps; v14). Dec 20 Speedtest Global has AU at 59Mbps. Could be possible timing difference. Or spatial anonymising masking shaping highest speeds. Else potentially data inconsistent between national average and geospatial detail. Check in upcoming quarters.NextSteps:Histogram - compare Q220, Q121, Q122. per v1.4.ipynb.Versions:v43. Added revised NZ vs AUS graph for Q325 (NZ; Q2 25) since had NZ available from Github (link below). Calc using PlayNZ.ipynb notebook. See images in Twitter - https://x.com/ValueMgmt/status/1981607615496122814v42: Added AUS Q325 (97.6k lines avg d/l 165.5 Mbps (median d/l 150.8 Mbps) u/l 28.08 Mbps). Imported using v2 Jupyter notebook (MBP 16Gb). Mean tests: 24.5. Mean devices: 6.02. Download, extract and publish: UNK - not measured mins. Download avg is double Q423. Noting, NBN increased D/L speeds from Sept '25; 100 -> 500, 250 -> 750. For 1Gbps, upload speed only increased from 50Mbps to 100Mbps. New 2Gbps services introduced on FTTP and HFC networks.v41: Added AUS Q225 (96k lines avg d/l 130.5 Mbps (median d/l 108.4 Mbps) u/l 22.45 Mbps). Imported using v2 Jupyter notebook (MBP 16Gb). Mean tests: 17.2. Mean devices: 5.11. Download, extract and publish: 20 mins. Download avg is double Q422.v40: Added AUS Q125 (93k lines avg d/l 116.6 Mbps u/l 21.35 Mbps). Imported using v2 Jupyter notebook (MBP 16Gb). Mean tests: 16.9. Mean devices: 5.13. Download, extract and publish: 14 mins.v39: Added AUS Q424 (95k lines avg d/l 110.9 Mbps u/l 21.02 Mbps). Imported using v2 Jupyter notebook (MBP 16Gb). Mean tests: 17.2. Mean devices: 5.24. Download, extract and publish: 14 mins.v38: Added AUS Q324 (92k lines avg d/l 107.0 Mbps u/l 20.79 Mbps). Imported using v2 Jupyter notebook (iMac 32Gb). Mean tests: 17.7. Mean devices: 5.33.Added github speedtest-workflow-importv2vis.ipynb Jupyter added datavis code to colour code national map. (per Binder on Github; link below).v37: Added AUS Q224 (91k lines avg d/l 97.40 Mbps u/l 19.88 Mbps). Imported using speedtest-workflow-importv2 jupyter notebook. Mean tests:18.1. Mean devices: 5.4.v36 Load UK data, Q1-23 and compare to AUS and NZ Q123 data. Add compare image (au-nz-ukQ123.png), calc PlayNZUK.ipynb, data load import-UK.ipynb. UK data bit rough and ready as uses rectangle to mark out UK, but includes some EIRE and FR. Indicative only and to be definitively needs geo-clean to exclude neighbouring countries.v35 Load Melb geo-maps of speed quartiles (0-25, 25-50, 50-75, 75-100, 100-). Avg in 2020; 41Mbps. Avg in 2023; 86Mbps. MelbQ323.png, MelbQ320.png. Calc with Speedtest-incHist.ipynb code. Needed to install conda mapclassify. ax=melb.plot(column=...dict(bins[25,50,75,100]))v34 Added AUS Q124 (93k lines avg d/l 87.00 Mbps u/l 18.86 Mbps). Imported using speedtest-workflow-importv2 jupyter notebook. Mean tests:18.3. Mean devices: 5.5.v33 Added AUS Q423 (92k lines avg d/l 82.62 Mbps). Imported using speedtest-workflow-importv2 jupyter notebook. Mean tests:18.0. Mean devices: 5.6. Added link to Github.v32 Recalc Au vs NZ for upload performance; added image. using PlayNZ Jupyter. NZ approx 40% locations at or above 100Mbps. Aus

  16. Z

    Blog-1K

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    • +1more
    Updated Dec 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Haining Wang (2022). Blog-1K [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7455622
    Explore at:
    Dataset updated
    Dec 21, 2022
    Dataset provided by
    Indiana University Bloomington
    Authors
    Haining Wang
    License

    https://www.isc.org/downloads/software-support-policy/isc-license/https://www.isc.org/downloads/software-support-policy/isc-license/

    Description

    The Blog-1K corpus is a redistributable authorship identification testbed for contemporary English prose. It has 1,000 candidate authors, 16K+ posts, and a pre-defined data split (train/dev/test proportional to ca. 8:1:1). It is a subset of the Blog Authorship Corpus from Kaggle. The MD5 for Blog-1K is '0a9e38740af9f921b6316b7f400acf06'.

    1. Preprocessing

    We first filter out texts shorter than 1,000 characters. Then we select one thousand authors whose writings meet the following criteria: - accumulatively at least 10,000 characters, - accumulatively at most 49,410 characters, - accumulatively at least 16 posts, - accumulatively at most 40 posts, and - each text has at least 50 function words found in the Koppel512 list (to filter out non-English prose).

    Blog-1K has three columns: 'id', 'text', and 'split', where 'id' corresponds to its parent corpus.

    1. Statistics

    Its creation and statistics can be found in the Jupyter Notebook.

        Split
        # Authors
        # Posts
        # Characters
        Avg. Characters Per Author (Std.)
        Avg. Characters Per Post (Std.)
    
    
        Train
        1,000
        16,132
        30,092,057
        30,092 (5,884)
        1,865 (1,007)
    
    
        Validation
        935
        2,017
        3,755,362
        4,016 (2,269)
        1,862 (999)
    
    
        Test
        924
        2,017
        3,732,448
        4,039 (2,188)
        1,850 (936)
    
    1. Usage

    import pandas as pd

    df = pd.read_csv('blog1000.csv.gz', compression='infer')

    read in training data

    train_text, train_label = zip(*df.loc[df.split=='train'][['text', 'id']].itertuples(index=False))

    1. License All the materials is licensed under the ISC License.

    2. Contact Please contact its maintainer for questions.

  17. Articles metadata from CrossRef

    • kaggle.com
    zip
    Updated Aug 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kea Kohv (2025). Articles metadata from CrossRef [Dataset]. https://www.kaggle.com/datasets/keakohv/articles-doi-metadata
    Explore at:
    zip(72982417 bytes)Available download formats
    Dataset updated
    Aug 1, 2025
    Authors
    Kea Kohv
    Description

    This data originates from Crossref API. It has metadata on the articles contained in Data Citation Corpus where the citation pair dataset is a DOI.

    How to recreate this dataset in Jupyter Notebook:

    1) Prepare list of articles to query ```python import pandas as pd

    See: https://www.kaggle.com/datasets/keakohv/data-citation-coprus-v4-1-eupmc-and-datacite

    CITATIONS_PARQUET = "data_citation_corpus_filtered_v4.1.parquet"

    Load the citation pairs from the Parquet file

    citation_pairs = pd.read_parquet(CITATIONS_PARQUET)

    Remove all rows where https is in the 'publication' column but no "doi.org" is present

    citation_pairs = citation_pairs[ ~((citation_pairs['dataset'].str.contains("https")) & (~citation_pairs['dataset'].str.contains("doi.org"))) ]

    Remove all rows where figshare is in the dataset name

    citation_pairs = citation_pairs[ ~citation_pairs['dataset'].str.contains("figshare") ]

    citation_pairs['is_doi'] = citation_pairs['dataset'].str.contains('doi.org', na=False)

    citation_pairs_doi = citation_pairs[citation_pairs['is_doi'] == True].copy()

    articles = list(set(citation_pairs_doi['publication'].to_list()))

    articles = [doi.replace("_", "/") for doi in articles]

    Save list articles to a file

    with open("articles.txt", "w") as f: for article in articles: f.write(f"{article} ") ```

    2) Query articles from CrossRef API

    
    %%writefile enrich.py
    #!pip install -q aiolimiter
    import sys, pathlib, asyncio, aiohttp, orjson, sqlite3, time
    from aiolimiter import AsyncLimiter
    
    # ---------- config ----------
    HEADERS  = {"User-Agent": "ForDataCiteEnrichment (mailto:your_email)"} # Put your email here
    MAX_RPS  = 45           # polite pool limit (50), leave head-room
    BATCH_SIZE = 10_000         # rows per INSERT
    DB_PATH  = pathlib.Path("crossref.sqlite").resolve()
    ARTICLES  = pathlib.Path("articles.txt")
    # -----------------------------
    
    # ---- platform tweak: prefer selector loop on Windows ----
    if sys.platform == "win32":
      asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
    
    # ---- read the DOI list ----
    with ARTICLES.open(encoding="utf-8") as f:
      DOIS = [line.strip() for line in f if line.strip()]
    
    # ---- make sure DB & table exist BEFORE the async part ----
    DB_PATH.parent.mkdir(parents=True, exist_ok=True)
    with sqlite3.connect(DB_PATH) as db:
      db.execute("""
        CREATE TABLE IF NOT EXISTS works (
          doi  TEXT PRIMARY KEY,
          json TEXT
        )
      """)
      db.execute("PRAGMA journal_mode=WAL;")   # better concurrency
    
    # ---------- async section ----------
    limiter = AsyncLimiter(MAX_RPS, 1)       # 45 req / second
    sem   = asyncio.Semaphore(100)        # cap overall concurrency
    
    async def fetch_one(session, doi: str):
      url = f"https://api.crossref.org/works/{doi}"
      async with limiter, sem:
        try:
          async with session.get(url, headers=HEADERS, timeout=10) as r:
            if r.status == 404:         # common “not found”
              return doi, None
            r.raise_for_status()        # propagate other 4xx/5xx
            return doi, await r.json()
        except Exception as e:
          return doi, None            # log later, don’t crash
    
    async def main():
      start = time.perf_counter()
      db  = sqlite3.connect(DB_PATH)        # KEEP ONE connection
      db.execute("PRAGMA synchronous = NORMAL;")   # speed tweak
    
      async with aiohttp.ClientSession(json_serialize=orjson.dumps) as s:
        for chunk_start in range(0, len(DOIS), BATCH_SIZE):
          slice_ = DOIS[chunk_start:chunk_start + BATCH_SIZE]
          tasks = [asyncio.create_task(fetch_one(s, d)) for d in slice_]
          results = await asyncio.gather(*tasks)    # all tuples, no exc
    
          good_rows, bad_dois = [], []
          for doi, payload in results:
            if payload is None:
              bad_dois.append(doi)
            else:
              good_rows.append((doi, orjson.dumps(payload).decode()))
    
          if good_rows:
            db.executemany(
              "INSERT OR IGNORE INTO works (doi, json) VALUES (?, ?)",
              good_rows,
            )
            db.commit()
    
          if bad_dois:                # append for later retry
            with open("failures.log", "a", encoding="utf-8") as fh:
              fh.writelines(f"{d}
    " for d in bad_dois)
    
          done = chunk_start + len(slice_)
          rate = done / (time.perf_counter() - start)
          print(f"{done:,}/{len(DOIS):,} ({rate:,.1f} DOI/s)")
    
      db.close()
    
    if _name_ == "_main_":
      asyncio.run(main())
    

    Then run: python !python enrich.py

    3) Finally extract the necessary fields

    import sqlite3
    import orjson
    i...
    
  18. z

    The Cultural Resource Curse: How Trade Dependence Undermines Creative...

    • zenodo.org
    bin, csv
    Updated Aug 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anon Anon; Anon Anon (2025). The Cultural Resource Curse: How Trade Dependence Undermines Creative Industries [Dataset]. http://doi.org/10.5281/zenodo.16784974
    Explore at:
    csv, binAvailable download formats
    Dataset updated
    Aug 9, 2025
    Dataset provided by
    Zenodo
    Authors
    Anon Anon; Anon Anon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset accompanies the study The Cultural Resource Curse: How Trade Dependence Undermines Creative Industries. It contains country-year panel data for 2000–2023 covering both OECD economies and the ten largest Latin American countries by land area. Variables include GDP per capita (constant PPP, USD), trade openness, internet penetration, education indicators, cultural exports per capita, and executive constraints from the Polity V dataset.

    The dataset supports a comparative analysis of how economic structure, institutional quality, and infrastructure shape cultural export performance across development contexts. Within-country fixed effects models show that trade openness constrains cultural exports in OECD economies but has no measurable effect in resource-dependent Latin America. In contrast, strong executive constraints benefit cultural industries in advanced economies while constraining them in extraction-oriented systems. The results provide empirical evidence for a two-stage development framework in which colonial extraction legacies create distinct constraints on creative industry growth.

    All variables are harmonized to ISO3 country codes and aligned on a common panel structure. The dataset is fully reproducible using the included Jupyter notebooks (OECD.ipynb, LATAM+OECD.ipynb, cervantes.ipynb).

    Contents:

    • GDPPC.csv — GDP per capita series from the World Bank.

    • explanatory.csv — Trade openness, internet penetration, and education indicators.

    • culture_exports.csv — UNESCO cultural export data.

    • p5v2018.csv — Polity V institutional indicators.

    • Jupyter notebooks for data processing and replication.

    Potential uses: Comparative political economy, cultural economics, institutional development, and resource curse research.

    How to Run This Dataset and Code in Google Colab

    These steps reproduce the OECD vs. Latin America analyses from the paper using the provided CSVs and notebooks.

    1) Open Colab and set up

    1. Go to https://colab.research.google.com

    2. Click File → New notebook.

    3. (Optional) If your files are in Google Drive, mount it:

    python
    CopiarEditar
    from google.colab import drive drive.mount('/content/drive')

    2) Get the data files into Colab

    You have two easy options:

    A. Upload the 4 CSVs + notebooks directly

    • In the left sidebar, click the folder icon → Upload.

    • Upload: GDPPC.csv, explanatory.csv, culture_exports.csv, p5v2018.csv, and any .ipynb you want to run.

    B. Use Google Drive

    • Put those files in a Drive folder.

    • After mounting Drive, refer to them with paths like /content/drive/MyDrive/your_folder/GDPPC.csv.

  19. Z

    Can Developers Prompt? A Controlled Experiment for Code Documentation...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Sep 11, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kruse, Hans-Alexander; Puhlfürß, Tim; Maalej, Walid (2024). Can Developers Prompt? A Controlled Experiment for Code Documentation Generation [Replication Package] [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13127237
    Explore at:
    Dataset updated
    Sep 11, 2024
    Dataset provided by
    Universität Hamburg
    Authors
    Kruse, Hans-Alexander; Puhlfürß, Tim; Maalej, Walid
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Summary of Artifacts

    This is the replication package for the paper titled 'Can Developers Prompt? A Controlled Experiment for Code Documentation Generation' that is part of the 40th IEEE International Conference on Software Maintenance and Evolution (ICSME), from October 6 to 11, 2024, located in Flagstaff, AZ, USA.

    Full Abstract

    Large language models (LLMs) bear great potential for automating tedious development tasks such as creating and maintaining code documentation. However, it is unclear to what extent developers can effectively prompt LLMs to create concise and useful documentation. We report on a controlled experiment with 20 professionals and 30 computer science students tasked with code documentation generation for two Python functions. The experimental group freely entered ad-hoc prompts in a ChatGPT-like extension of Visual Studio Code, while the control group executed a predefined few-shot prompt. Our results reveal that professionals and students were unaware of or unable to apply prompt engineering techniques. Especially students perceived the documentation produced from ad-hoc prompts as significantly less readable, less concise, and less helpful than documentation from prepared prompts. Some professionals produced higher quality documentation by just including the keyword Docstring in their ad-hoc prompts. While students desired more support in formulating prompts, professionals appreciated the flexibility of ad-hoc prompting. Participants in both groups rarely assessed the output as perfect. Instead, they understood the tools as support to iteratively refine the documentation. Further research is needed to understand which prompting skills and preferences developers have and which support they need for certain tasks.

    Author Information

    Name Affiliation Email

    Hans-Alexander Kruse Universität Hamburg hans-alexander.kruse@studium.uni-hamburg.de

    Tim Puhlfürß Universität Hamburg tim.puhlfuerss@uni-hamburg.de

    Walid Maalej Universität Hamburg walid.maalej@uni-hamburg.de

    Citation Information

    @inproceedings{kruse-icsme-2024, author={Kruse, Hans-Alexander and Puhlf{"u}r{\ss}, Tim and Maalej, Walid}, booktitle={2022 IEEE International Conference on Software Maintenance and Evolution}, title={Can Developers Prompt? A Controlled Experiment for Code Documentation Generation}, year={2024}, doi={tba}, }

    Artifacts Overview

    1. Preprint

    The file kruse-icsme-2024-preprint.pdf is the preprint version of the official paper. You should read the paper in detail to understand the study, especially its methodology and results.

    1. Results

    The folder results includes two subfolders, explained in the following.

    Demographics RQ1 RQ2

    The subfolder Demographics RQ1 RQ2 provides Jupyter Notebook file evaluation.ipynb for analyzing (1) the experiment participants' submissions of the digital survey and (2) the ad-hoc prompts that the experimental group entered into their tool. Hence, this file provides demographic information about the participants and results for the research questions 1 and 2. Please refer to the README file inside this subfolder for installation steps of the Jupyter Notebook file.

    RQ2

    The subfolder RQ2 contains further subfolders with Microsoft Excel files specific to the results of research question 2:

    The subfolder UEQ contains three times the official User Experience Questionnaire (UEQ) analysis Excel tool, with data entered from all participants/students/professionals.

    The subfolder Open Coding contains three Excel files with the open-coding results for the free-text answers that participants could enter at the end of the survey to state additional positive and negative comments about their experience during the experiment. The Consensus file provides the finalized version of the open coding process.

    1. Extension

    The folder extension contains the code of the Visual Studio Code (VS Code) extension developed in this study to generate code documentation with predefined prompts. Please refer to the README file inside the folder for installation steps. Alternatively, you can install the deployed version of this tool, called Code Docs AI, via the VS Code Marketplace.

    You can install the tool to generate code documentation with ad-hoc prompts directly via the VS Code Marketplace. We did not include the code of this extension in this replication package due to license conflicts (GNUv3 vs. MIT).

    1. Survey

    The folder survey contains PDFs of the digital survey in two versions:

    The file Survey.pdf contains the rendered version of the survey (how it was presented to participants).

    The file SurveyOptions.pdf is an export of the LimeSurvey web platform. Its main purpose is to provide the technical answer codes, e.g., AO01 and AO02, that refer to the rendered answer texts, e.g., Yes and No. This can help you if you want to analyze the CSV files inside the results folder (instead of using the Jupyter Notebook file), as the CSVs contain the answer codes, not the answer texts. Please note that an export issue caused page 9 to be almost blank. However, this problem is negligible as the question on this page only contained one free-text answer field.

    1. Appendix

    The folder appendix provides additional material about the study:

    The subfolder tool_screenshots contains screenshots of both tools.

    The file few_shots.txt lists the few shots used for the predefined prompt tool.

    The file test_functions.py lists the functions used in the experiment.

    Revisions

    Version Changelog

    1.0.0 Initial upload

    1.1.0 Add paper preprint. Update abstract.

    1.2.0 Update replication package based on ICSME Artifact Track reviews

    License

    See LICENSE file.

  20. NeoModeling Framework: Leveraging Graph-Based Persistence for Large-Scale...

    • zenodo.org
    zip
    Updated Sep 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luciano Marchezan; Luciano Marchezan; Nikitchyn Vitalii; Eugene Syriani; Eugene Syriani; Nikitchyn Vitalii (2025). NeoModeling Framework: Leveraging Graph-Based Persistence for Large-Scale Model-Driven Engineering (replication package) [Dataset]. http://doi.org/10.5281/zenodo.17238878
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 30, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Luciano Marchezan; Luciano Marchezan; Nikitchyn Vitalii; Eugene Syriani; Eugene Syriani; Nikitchyn Vitalii
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the replication package for the paper "NeoModeling Framework: Leveraging Graph-Based Persistence for Large-Scale Model-Driven Engineering" where we present Neo Modeling Framework (NMF), an open-source set of tools primarily designed to manipulate ultra-large datasets in the Neo4j database.

    Repository structure

    • NeoModelingFramework.zip - contains the replication package, including the source code for NMF, test files to run the evaluation, used artifacts, and instructions to run the framework. The most import folders are listed below:
      • codeGenerator - NMF generator module
      • modelLoader - NMF loader module
      • modelEditor - NMF editor module
      • Evaluation - contains the evaluation artifacts and results (a copy
        • metamodels - Ecore files used for RQ1 and RQ2
        • results - CSV files with the results from RQ1, RQ2 and RQ3
        • analysis - Jupyter notebooks used to analyze and plot the results

    Running NMF

    The best way to run NMF is following the instructions at our GitHub repository. A copy of the Readme file is also present inside the zip file available here.

    Empirical Evaluation

    Make sure that you follow the instructions to run NMF.

    The quantitative evaluation can be re-run by running RQ1Eval.kt, RQ2Eval.kt inside modelLoader/src/test/kotlin/evaluation and RQ2Eval.kt inside modelEditor/src/test/kotlin/evaluation.

    Make sure that you have an empty instance of Neo4j running.


    Results will be generated as CSV files, under Evaluation/results and the results can be plotted by running the Jupyter Notebooks at Evaluation/analysis.

    Please note that due to differences in hardware, re-running the experiments will probably generate slightly different results than those reported in the paper.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Johanna Schultz (2022). csv file for jupyter notebook [Dataset]. http://doi.org/10.6084/m9.figshare.21590175.v1
Organization logo

csv file for jupyter notebook

Explore at:
txtAvailable download formats
Dataset updated
Nov 21, 2022
Dataset provided by
Figsharehttp://figshare.com/
Authors
Johanna Schultz
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

df_force_kin_filtered.csv is the data sheet used for the DATA3 python notebook to analyse kinematics and dynamics combined. It contains the footfalls that hava data for both: kinematics and dynamics. To see how this file is generated, read the first half of the jupyter notebook

Search
Clear search
Close search
Google apps
Main menu