40 datasets found
  1. Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter...

    • zenodo.org
    application/gzip
    Updated Mar 16, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana (2021). Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks / Understanding and Improving the Quality and Reproducibility of Jupyter Notebooks [Dataset]. http://doi.org/10.5281/zenodo.3519618
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Mar 16, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourages poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub. Based on the results, we proposed and evaluated Julynter, a linting tool for Jupyter Notebooks.

    Papers:

    This repository contains three files:

    Reproducing the Notebook Study

    The db2020-09-22.dump.gz file contains a PostgreSQL dump of the database, with all the data we extracted from notebooks. For loading it, run:

    gunzip -c db2020-09-22.dump.gz | psql jupyter

    Note that this file contains only the database with the extracted data. The actual repositories are available in a google drive folder, which also contains the docker images we used in the reproducibility study. The repositories are stored as content/{hash_dir1}/{hash_dir2}.tar.bz2, where hash_dir1 and hash_dir2 are columns of repositories in the database.

    For scripts, notebooks, and detailed instructions on how to analyze or reproduce the data collection, please check the instructions on the Jupyter Archaeology repository (tag 1.0.0)

    The sample.tar.gz file contains the repositories obtained during the manual sampling.

    Reproducing the Julynter Experiment

    The julynter_reproducility.tar.gz file contains all the data collected in the Julynter experiment and the analysis notebooks. Reproducing the analysis is straightforward:

    • Uncompress the file: $ tar zxvf julynter_reproducibility.tar.gz
    • Install the dependencies: $ pip install julynter/requirements.txt
    • Run the notebooks in order: J1.Data.Collection.ipynb; J2.Recommendations.ipynb; J3.Usability.ipynb.

    The collected data is stored in the julynter/data folder.

    Changelog

    2019/01/14 - Version 1 - Initial version
    2019/01/22 - Version 2 - Update N8.Execution.ipynb to calculate the rate of failure for each reason
    2019/03/13 - Version 3 - Update package for camera ready. Add columns to db to detect duplicates, change notebooks to consider them, and add N1.Skip.Notebook.ipynb and N11.Repository.With.Notebook.Restriction.ipynb.
    2021/03/15 - Version 4 - Add Julynter experiment; Update database dump to include new data collected for the second paper; remove scripts and analysis notebooks from this package (moved to GitHub), add a link to Google Drive with collected repository files

  2. Speedtest Open Data - Four International cities - MEL, BKK, SHG, LAX plus...

    • figshare.com
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Richard Ferrers; Speedtest Global Index (2023). Speedtest Open Data - Four International cities - MEL, BKK, SHG, LAX plus ALC - 2020, 2022 [Dataset]. http://doi.org/10.6084/m9.figshare.13621169.v24
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Richard Ferrers; Speedtest Global Index
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset compares four cities FIXED-line broadband internet speeds: - Melbourne, AU - Bangkok, TH - Shanghai, CN - Los Angeles, US - Alice Springs, AU

    ERRATA: 1.Data is for Q3 2020, but some files are labelled incorrectly as 02-20 of June 20. They all should read Sept 20, or 09-20 as Q3 20, rather than Q2. Will rename and reload. Amended in v7.

    1. LAX file named 0320, when should be Q320. Amended in v8.

    *lines of data for each geojson file; a line equates to a 600m^2 location, inc total tests, devices used, and average upload and download speed - MEL 16181 locations/lines => 0.85M speedtests (16.7 tests per 100people) - SHG 31745 lines => 0.65M speedtests (2.5/100pp) - BKK 29296 lines => 1.5M speedtests (14.3/100pp) - LAX 15899 lines => 1.3M speedtests (10.4/100pp) - ALC 76 lines => 500 speedtests (2/100pp)

    Geojsons of these 2* by 2* extracts for MEL, BKK, SHG now added, and LAX added v6. Alice Springs added v15.

    This dataset unpacks, geospatially, data summaries provided in Speedtest Global Index (linked below). See Jupyter Notebook (*.ipynb) to interrogate geo data. See link to install Jupyter.

    ** To Do Will add Google Map versions so everyone can see without installing Jupyter. - Link to Google Map (BKK) added below. Key:Green > 100Mbps(Superfast). Black > 500Mbps (Ultrafast). CSV provided. Code in Speedtestv1.1.ipynb Jupyter Notebook. - Community (Whirlpool) surprised [Link: https://whrl.pl/RgAPTl] that Melb has 20% at or above 100Mbps. Suggest plot Top 20% on map for community. Google Map link - now added (and tweet).

    ** Python melb = au_tiles.cx[144:146 , -39:-37] #Lat/Lon extract shg = tiles.cx[120:122 , 30:32] #Lat/Lon extract bkk = tiles.cx[100:102 , 13:15] #Lat/Lon extract lax = tiles.cx[-118:-120, 33:35] #lat/Lon extract ALC=tiles.cx[132:134, -22:-24] #Lat/Lon extract

    Histograms (v9), and data visualisations (v3,5,9,11) will be provided. Data Sourced from - This is an extract of Speedtest Open data available at Amazon WS (link below - opendata.aws).

    **VERSIONS v.24 Add tweet and google map of Top 20% (over 100Mbps locations) in Mel Q322. Add v.1.5 MEL-Superfast notebook, and CSV of results (now on Google Map; link below). v23. Add graph of 2022 Broadband distribution, and compare 2020 - 2022. Updated v1.4 Jupyter notebook. v22. Add Import ipynb; workflow-import-4cities. v21. Add Q3 2022 data; five cities inc ALC. Geojson files. (2020; 4.3M tests 2022; 2.9M tests)

    Melb 14784 lines Avg download speed 69.4M Tests 0.39M

    SHG 31207 lines Avg 233.7M Tests 0.56M

    ALC 113 lines Avg 51.5M Test 1092

    BKK 29684 lines Avg 215.9M Tests 1.2M

    LAX 15505 lines Avg 218.5M Tests 0.74M

    v20. Speedtest - Five Cities inc ALC. v19. Add ALC2.ipynb. v18. Add ALC line graph. v17. Added ipynb for ALC. Added ALC to title.v16. Load Alice Springs Data Q221 - csv. Added Google Map link of ALC. v15. Load Melb Q1 2021 data - csv. V14. Added Melb Q1 2021 data - geojson. v13. Added Twitter link to pics. v12 Add Line-Compare pic (fastest 1000 locations) inc Jupyter (nbn-intl-v1.2.ipynb). v11 Add Line-Compare pic, plotting Four Cities on a graph. v10 Add Four Histograms in one pic. v9 Add Histogram for Four Cities. Add NBN-Intl.v1.1.ipynb (Jupyter Notebook). v8 Renamed LAX file to Q3, rather than 03. v7 Amended file names of BKK files to correctly label as Q3, not Q2 or 06. v6 Added LAX file. v5 Add screenshot of BKK Google Map. v4 Add BKK Google map(link below), and BKK csv mapping files. v3 replaced MEL map with big key version. Prev key was very tiny in top right corner. v2 Uploaded MEL, SHG, BKK data and Jupyter Notebook v1 Metadata record

    ** LICENCE AWS data licence on Speedtest data is "CC BY-NC-SA 4.0", so use of this data must be: - non-commercial (NC) - reuse must be share-alike (SA)(add same licence). This restricts the standard CC-BY Figshare licence.

    ** Other uses of Speedtest Open Data; - see link at Speedtest below.

  3. Galaxy Training Material for the 'Use Jupyter notebooks in Galaxy' tutorial

    • zenodo.org
    csv
    Updated Apr 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Delphine Lariviere; Delphine Lariviere; Teresa Müller; Teresa Müller (2025). Galaxy Training Material for the 'Use Jupyter notebooks in Galaxy' tutorial [Dataset]. http://doi.org/10.5281/zenodo.15263830
    Explore at:
    csvAvailable download formats
    Dataset updated
    Apr 22, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Delphine Lariviere; Delphine Lariviere; Teresa Müller; Teresa Müller
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset was originally curated by Software Carpentry, a branch of The Carpentries non-profit organization, and is based on data from the Gapminder Foundation. It consists of six tabular CSV files containing GDP data for various countries across different years. The dataset was initially prepared for the Software Carpentry tutorial "Plotting and Programming in Python" and is also reused in the Galaxy Training Network (GTN) tutorial "Use Jupyter Notebooks in Galaxy."

    This GTN tutorial provides an introduction to launching a Jupyter Notebook in Galaxy, installing dependencies, and importing and exporting data. It serves as a setup guide for a Jupyter Notebook environment that can be used to follow the Software Carpentry tutorial "Plotting and Programming in Python."

  4. o

    Demographic Analysis Workflow using Census API in Jupyter Notebook:...

    • openicpsr.org
    delimited
    Updated Jul 23, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Donghwan Gu; Nathanael Rosenheim (2020). Demographic Analysis Workflow using Census API in Jupyter Notebook: 1990-2000 Population Size and Change [Dataset]. http://doi.org/10.3886/E120381V1
    Explore at:
    delimitedAvailable download formats
    Dataset updated
    Jul 23, 2020
    Dataset provided by
    Texas A&M University
    Authors
    Donghwan Gu; Nathanael Rosenheim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    US Counties, Kentucky, Boone County
    Description

    This archive reproduces a table titled "Table 3.1 Boone county population size, 1990 and 2000" from Wang and vom Hofe (2007, p.58). The archive provides a Jupyter Notebook that uses Python and can be run in Google Colaboratory. The workflow uses Census API to retrieve data, reproduce the table, and ensure reproducibility for anyone accessing this archive.The Python code was developed in Google Colaboratory, or Google Colab for short, which is an Integrated Development Environment (IDE) of JupyterLab and streamlines package installation, code collaboration and management. The Census API is used to obtain population counts from the 1990 and 2000 Decennial Census (Summary File 1, 100% data). All downloaded data are maintained in the notebook's temporary working directory while in use. The data are also stored separately with this archive.The notebook features extensive explanations, comments, code snippets, and code output. The notebook can be viewed in a PDF format or downloaded and opened in Google Colab. References to external resources are also provided for the various functional components. The notebook features code to perform the following functions:install/import necessary Python packagesintroduce a Census API Querydownload Census data via CensusAPI manipulate Census tabular data calculate absolute change and percent changeformatting numbersexport the table to csvThe notebook can be modified to perform the same operations for any county in the United States by changing the State and County FIPS code parameters for the Census API downloads. The notebook could be adapted for use in other environments (i.e., Jupyter Notebook) as well as reading and writing files to a local or shared drive, or cloud drive (i.e., Google Drive).

  5. h

    iris

    • huggingface.co
    Updated Jun 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    James (2025). iris [Dataset]. https://huggingface.co/datasets/butlerj/iris
    Explore at:
    Dataset updated
    Jun 26, 2025
    Authors
    James
    Description

    Iris

    The following code can be used to access the dataset from its stored location at NERSC. You may also access this code via a NERSC-hosted Jupyter notebook here. import pandas as pd

    dat = pd.read_csv('data/iris.csv')

  6. Articles metadata from CrossRef

    • kaggle.com
    zip
    Updated Aug 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kea Kohv (2025). Articles metadata from CrossRef [Dataset]. https://www.kaggle.com/datasets/keakohv/articles-doi-metadata
    Explore at:
    zip(72982417 bytes)Available download formats
    Dataset updated
    Aug 1, 2025
    Authors
    Kea Kohv
    Description

    This data originates from Crossref API. It has metadata on the articles contained in Data Citation Corpus where the citation pair dataset is a DOI.

    How to recreate this dataset in Jupyter Notebook:

    1) Prepare list of articles to query ```python import pandas as pd

    See: https://www.kaggle.com/datasets/keakohv/data-citation-coprus-v4-1-eupmc-and-datacite

    CITATIONS_PARQUET = "data_citation_corpus_filtered_v4.1.parquet"

    Load the citation pairs from the Parquet file

    citation_pairs = pd.read_parquet(CITATIONS_PARQUET)

    Remove all rows where https is in the 'publication' column but no "doi.org" is present

    citation_pairs = citation_pairs[ ~((citation_pairs['dataset'].str.contains("https")) & (~citation_pairs['dataset'].str.contains("doi.org"))) ]

    Remove all rows where figshare is in the dataset name

    citation_pairs = citation_pairs[ ~citation_pairs['dataset'].str.contains("figshare") ]

    citation_pairs['is_doi'] = citation_pairs['dataset'].str.contains('doi.org', na=False)

    citation_pairs_doi = citation_pairs[citation_pairs['is_doi'] == True].copy()

    articles = list(set(citation_pairs_doi['publication'].to_list()))

    articles = [doi.replace("_", "/") for doi in articles]

    Save list articles to a file

    with open("articles.txt", "w") as f: for article in articles: f.write(f"{article} ") ```

    2) Query articles from CrossRef API

    
    %%writefile enrich.py
    #!pip install -q aiolimiter
    import sys, pathlib, asyncio, aiohttp, orjson, sqlite3, time
    from aiolimiter import AsyncLimiter
    
    # ---------- config ----------
    HEADERS  = {"User-Agent": "ForDataCiteEnrichment (mailto:your_email)"} # Put your email here
    MAX_RPS  = 45           # polite pool limit (50), leave head-room
    BATCH_SIZE = 10_000         # rows per INSERT
    DB_PATH  = pathlib.Path("crossref.sqlite").resolve()
    ARTICLES  = pathlib.Path("articles.txt")
    # -----------------------------
    
    # ---- platform tweak: prefer selector loop on Windows ----
    if sys.platform == "win32":
      asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
    
    # ---- read the DOI list ----
    with ARTICLES.open(encoding="utf-8") as f:
      DOIS = [line.strip() for line in f if line.strip()]
    
    # ---- make sure DB & table exist BEFORE the async part ----
    DB_PATH.parent.mkdir(parents=True, exist_ok=True)
    with sqlite3.connect(DB_PATH) as db:
      db.execute("""
        CREATE TABLE IF NOT EXISTS works (
          doi  TEXT PRIMARY KEY,
          json TEXT
        )
      """)
      db.execute("PRAGMA journal_mode=WAL;")   # better concurrency
    
    # ---------- async section ----------
    limiter = AsyncLimiter(MAX_RPS, 1)       # 45 req / second
    sem   = asyncio.Semaphore(100)        # cap overall concurrency
    
    async def fetch_one(session, doi: str):
      url = f"https://api.crossref.org/works/{doi}"
      async with limiter, sem:
        try:
          async with session.get(url, headers=HEADERS, timeout=10) as r:
            if r.status == 404:         # common “not found”
              return doi, None
            r.raise_for_status()        # propagate other 4xx/5xx
            return doi, await r.json()
        except Exception as e:
          return doi, None            # log later, don’t crash
    
    async def main():
      start = time.perf_counter()
      db  = sqlite3.connect(DB_PATH)        # KEEP ONE connection
      db.execute("PRAGMA synchronous = NORMAL;")   # speed tweak
    
      async with aiohttp.ClientSession(json_serialize=orjson.dumps) as s:
        for chunk_start in range(0, len(DOIS), BATCH_SIZE):
          slice_ = DOIS[chunk_start:chunk_start + BATCH_SIZE]
          tasks = [asyncio.create_task(fetch_one(s, d)) for d in slice_]
          results = await asyncio.gather(*tasks)    # all tuples, no exc
    
          good_rows, bad_dois = [], []
          for doi, payload in results:
            if payload is None:
              bad_dois.append(doi)
            else:
              good_rows.append((doi, orjson.dumps(payload).decode()))
    
          if good_rows:
            db.executemany(
              "INSERT OR IGNORE INTO works (doi, json) VALUES (?, ?)",
              good_rows,
            )
            db.commit()
    
          if bad_dois:                # append for later retry
            with open("failures.log", "a", encoding="utf-8") as fh:
              fh.writelines(f"{d}
    " for d in bad_dois)
    
          done = chunk_start + len(slice_)
          rate = done / (time.perf_counter() - start)
          print(f"{done:,}/{len(DOIS):,} ({rate:,.1f} DOI/s)")
    
      db.close()
    
    if _name_ == "_main_":
      asyncio.run(main())
    

    Then run: python !python enrich.py

    3) Finally extract the necessary fields

    import sqlite3
    import orjson
    i...
    
  7. I

    Data from: Enhancing Carrier Mobility In Monolayer MoS2 Transistors With...

    • aws-databank-alb.library.illinois.edu
    • databank.illinois.edu
    Updated Mar 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yue Zhang; Helin Zhao; Siyuan Huang; Mohhamad Abir Hossain; Arend van der Zande (2024). Enhancing Carrier Mobility In Monolayer MoS2 Transistors With Process induced Strain [Dataset]. http://doi.org/10.13012/B2IDB-7519929_V1
    Explore at:
    Dataset updated
    Mar 28, 2024
    Authors
    Yue Zhang; Helin Zhao; Siyuan Huang; Mohhamad Abir Hossain; Arend van der Zande
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Read me file for the data repository ******************************************************************************* This repository has raw data for the publication "Enhancing Carrier Mobility In Monolayer MoS2 Transistors With Process Induced Strain". We arrange the data following the figure in which it first appeared. For all electrical transfer measurement, we provide the up-sweep and down-sweep data, with voltage units in V and conductance unit in S. All Raman modes have unit of cm^-1. ******************************************************************************* How to use this dataset All data in this dataset is stored in binary Numpy array format as .npy file. To read a .npy file: use the Numpy module of the python language, and use np.load() command. Example: suppose the filename is example_data.npy. To load it into a python program, open a Jupyter notebook, or in the python program, run: import numpy as np data = np.load("example_data.npy") Then the example file is stored in the data object. *******************************************************************************

  8. H

    Development and Implementation of Database and Analyses for High Frequency...

    • hydroshare.org
    • beta.hydroshare.org
    • +1more
    zip
    Updated Jun 11, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hyrum Tennant; Amber Spackman Jones (2019). Development and Implementation of Database and Analyses for High Frequency Data [Dataset]. https://www.hydroshare.org/resource/bf57045c30054383a6df9bb8cab381d3
    Explore at:
    zip(712.9 MB)Available download formats
    Dataset updated
    Jun 11, 2019
    Dataset provided by
    HydroShare
    Authors
    Hyrum Tennant; Amber Spackman Jones
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 2014 - May 22, 2018
    Area covered
    Description

    For environmental data measured by a variety of sensors and compiled from various sources, practitioners need tools that facilitate data access and data analysis. Data are often organized in formats that are incompatible with each other and that prevent full data integration. Furthermore, analyses of these data are hampered by the inadequate mechanisms for storage and organization. Ideally, data should be centrally housed and organized in an intuitive structure with established patterns for analyses. However, in reality, the data are often scattered in multiple files without uniform structure that must be transferred between users and called individually and manually for each analysis. This effort describes a process for compiling environmental data into a single, central database that can be accessed for analyses. We use the Logan River watershed and observed water level, discharge, specific conductance, and temperature as a test case. Of interest is analysis of flow partitioning. We formatted data files and organized them into a hierarchy, and we developed scripts that import the data to a database with structure designed for hydrologic time series data. Scripts access the populated database to determine baseflow separation, flow balance, and mass balance and visualize the results. The analyses were compiled into a package of scripts in Python, which can be modified and run by scientists and researchers to determine gains and losses in reaches of interest. To facilitate reproducibility, the database and associated scripts were shared to HydroShare as Jupyter Notebooks so that any user can access the data and perform the analyses, which facilitates standardization of these operations.

  9. d

    National Water Model HydroLearn Python Notebooks

    • search.dataone.org
    Updated Apr 15, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Justin Hunter (2022). National Water Model HydroLearn Python Notebooks [Dataset]. https://search.dataone.org/view/sha256%3A671542b16aaceb670d8f579224bba83cbcfa0cd9a6a0ffabe05e3767164448d7
    Explore at:
    Dataset updated
    Apr 15, 2022
    Dataset provided by
    Hydroshare
    Authors
    Justin Hunter
    Area covered
    Description

    This resource contains Jupyter Python notebooks which are intended to be used to learn about the U.S. National Water Model (NWM). These notebooks explore NWM forecasts in various ways. NWM Notebooks 1, 2, and 3, access NWM forecasts directly from the NOAA NOMADS file sharing system. Notebook 4 accesses NWM forecasts from Google Cloud Platform (GCP) storage in addition to NOMADS. A brief summary of what each notebook does is included below:

    Notebook 1 (NWM1_Visualization) focuses on visualization. It includes functions for downloading and extracting time series forecasts for any of the 2.7 million stream reaches of the U.S. NWM. It also demonstrates ways to visualize forecasts using Python packages like matplotlib.

    Notebook 2 (NWM2_Xarray) explores methods for slicing and dicing NWM NetCDF files using the python library, XArray.

    Notebook 3 (NWM3_Subsetting) is focused on subsetting NWM forecasts and NetCDF files for specified reaches and exporting NWM forecast data to CSV files.

    Notebook 4 (NWM4_Hydrotools) uses Hydrotools, a new suite of tools for evaluating NWM data, to retrieve NWM forecasts both from NOMADS and from Google Cloud Platform storage where older NWM forecasts are cached. This notebook also briefly covers visualizing, subsetting, and exporting forecasts retrieved with Hydrotools.

    The notebooks are part of a NWM learning module on HydroLearn.org. When the associated learning module is complete, the link to it will be added here. It is recommended that these notebooks be opened through the CUAHSI JupyterHub App on Hydroshare. This can be done via the 'Open With' button at the top of this resource page.

  10. OpenOrca

    • kaggle.com
    • opendatalab.com
    • +1more
    zip
    Updated Nov 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). OpenOrca [Dataset]. https://www.kaggle.com/datasets/thedevastator/open-orca-augmented-flan-dataset/versions/2
    Explore at:
    zip(2548102631 bytes)Available download formats
    Dataset updated
    Nov 22, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Open-Orca Augmented FLAN Dataset

    Unlocking Advanced Language Understanding and ML Model Performance

    By Huggingface Hub [source]

    About this dataset

    The Open-Orca Augmented FLAN Collection is a revolutionary dataset that unlocks new levels of language understanding and machine learning model performance. This dataset was created to support research on natural language processing, machine learning models, and language understanding through leveraging the power of reasoning trace-enhancement techniques. By enabling models to understand complex relationships between words, phrases, and even entire sentences in a more robust way than ever before, this dataset provides researchers expanded opportunities for furthering the progress of linguistics research. With its unique combination of features including system prompts, questions from users and responses from systems, this dataset opens up exciting possibilities for deeper exploration into the cutting edge concepts underlying advanced linguistics applications. Experience a new level of accuracy and performance - explore Open-Orca Augmented FLAN Collection today!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This guide provides an introduction to the Open-Orca Augmented FLAN Collection dataset and outlines how researchers can utilize it for their language understanding and natural language processing (NLP) work. The Open-Orca dataset includes system prompts, questions posed by users, and responses from the system.

    Getting Started The first step is to download the data set from Kaggle at https://www.kaggle.com/openai/open-orca-augmented-flan and save it in a project directory of your choice on your computer or cloud storage space. Once you have downloaded the data set, launch your ‘Jupyter Notebook’ or ‘Google Colab’ program with which you want to work with this data set.

    Exploring & Preprocessing Data: To get a better understanding of the features in this dataset, import them into Pandas DataFrame as shown below. You can use other libraries as per your need:

    import pandas as pd   # Library used for importing datasets into Python 
    
    df = pd.read_csv('train.csv') #Imports train csv file into Pandas};#DataFrame 
    
    df[['system_prompt','question','response']].head() #Views top 5 rows with columns 'system_prompt','question','response'
    

    After importing check each feature using basic descriptive statistics such Pandas groupby statement: We can use groupby statements to have greater clarity over the variables present in each feature(elements). The below command will show counts of each element in System Prompt column present under train CVS file :

     df['system prompt'].value_counts().head()#shows count of each element present under 'System Prompt'column
     Output: User says hello guys 587 <br>System asks How are you?: 555 times<br>User says I am doing good: 487 times <br>..and so on   
    

    Data Transformation: After inspecting & exploring different features one may want/need certain changes that best suits their needs from this dataset before training modeling algorithms on it.
    Common transformation steps include : Removing punctuation marks : Since punctuation marks may not add any value to computation operations , we can remove them using regex functions write .replace('[^A-Za -z]+','' ) as

    Research Ideas

    • Automated Question Answering: Leverage the dataset to train and develop question answering models that can provide tailored answers to specific user queries while retaining language understanding abilities.
    • Natural Language Understanding: Use the dataset as an exploratory tool for fine-tuning natural language processing applications, such as sentiment analysis, document categorization, parts-of-speech tagging and more.
    • Machine Learning Optimizations: The dataset can be used to build highly customized machine learning pipelines that allow users to harness the power of conditioning data with pre-existing rules or models for improved accuracy and performance in automated tasks

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. [See Other Information](ht...

  11. Z

    Data from: Long-Term Tracing of Indoor Solar Harvesting

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    • +1more
    Updated Jul 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sigrist, Lukas; Gomez, Andres; Thiele, Lothar (2024). Long-Term Tracing of Indoor Solar Harvesting [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_3346975
    Explore at:
    Dataset updated
    Jul 22, 2024
    Dataset provided by
    ETH Zurich
    Authors
    Sigrist, Lukas; Gomez, Andres; Thiele, Lothar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Information

    This dataset presents long-term term indoor solar harvesting traces and jointly monitored with the ambient conditions. The data is recorded at 6 indoor positions with diverse characteristics at our institute at ETH Zurich in Zurich, Switzerland.

    The data is collected with a measurement platform [3] consisting of a solar panel (AM-5412) connected to a bq25505 energy harvesting chip that stores the harvested energy in a virtual battery circuit. Two TSL45315 light sensors placed on opposite sides of the solar panel monitor the illuminance level and a BME280 sensor logs ambient conditions like temperature, humidity and air pressure.

    The dataset contains the measurement of the energy flow at the input and the output of the bq25505 harvesting circuit, as well as the illuminance, temperature, humidity and air pressure measurements of the ambient sensors. The following timestamped data columns are available in the raw measurement format, as well as preprocessed and filtered HDF5 datasets:

    V_in - Converter input/solar panel output voltage, in volt

    I_in - Converter input/solar panel output current, in ampere

    V_bat - Battery voltage (emulated through circuit), in volt

    I_bat - Net Battery current, in/out flowing current, in ampere

    Ev_left - Illuminance left of solar panel, in lux

    Ev_right - Illuminance left of solar panel, in lux

    P_amb - Ambient air pressure, in pascal

    RH_amb - Ambient relative humidity, unit-less between 0 and 1

    T_amb - Ambient temperature, in centigrade Celsius

    The following publication presents and overview of the dataset and more details on the deployment used for data collection. A copy of the abstract is included in this dataset, see the file abstract.pdf.

    L. Sigrist, A. Gomez, and L. Thiele. "Dataset: Tracing Indoor Solar Harvesting." In Proceedings of the 2nd Workshop on Data Acquisition To Analysis (DATA '19), 2019.

    Folder Structure and Files

    processed/ - This folder holds the imported, merged and filtered datasets of the power and sensor measurements. The datasets are stored in HDF5 format and split by measurement position posXX and and power and ambient sensor measurements. The files belonging to this folder are contained in archives named yyyy_mm_processed.tar, where yyyy and mm represent the year and month the data was published. A separate file lists the exact content of each archive (see below).

    raw/ - This folder holds the raw measurement files recorded with the RocketLogger [1, 2] and using the measurement platform available at [3]. The files belonging to this folder are contained in archives named yyyy_mm_raw.tar, where yyyy and mmrepresent the year and month the data was published. A separate file lists the exact content of each archive (see below).

    LICENSE - License information for the dataset.

    README.md - The README file containing this information.

    abstract.pdf - A copy of the above mentioned abstract submitted to the DATA '19 Workshop, introducing this dataset and the deployment used to collect it.

    raw_import.ipynb [open in nbviewer] - Jupyter Python notebook to import, merge, and filter the raw dataset from the raw/ folder. This is the exact code used to generate the processed dataset and store it in the HDF5 format in the processed/folder.

    raw_preview.ipynb [open in nbviewer] - This Jupyter Python notebook imports the raw dataset directly and plots a preview of the full power trace for all measurement positions.

    processing_python.ipynb [open in nbviewer] - Jupyter Python notebook demonstrating the import and use of the processed dataset in Python. Calculates column-wise statistics, includes more detailed power plots and the simple energy predictor performance comparison included in the abstract.

    processing_r.ipynb [open in nbviewer] - Jupyter R notebook demonstrating the import and use of the processed dataset in R. Calculates column-wise statistics and extracts and plots the energy harvesting conversion efficiency included in the abstract. Furthermore, the harvested power is analyzed as a function of the ambient light level.

    Dataset File Lists

    Processed Dataset Files

    The list of the processed datasets included in the yyyy_mm_processed.tar archive is provided in yyyy_mm_processed.files.md. The markdown formatted table lists the name of all files, their size in bytes, as well as the SHA-256 sums.

    Raw Dataset Files

    A list of the raw measurement files included in the yyyy_mm_raw.tar archive(s) is provided in yyyy_mm_raw.files.md. The markdown formatted table lists the name of all files, their size in bytes, as well as the SHA-256 sums.

    Dataset Revisions

    v1.0 (2019-08-03)

    Initial release. Includes the data collected from 2017-07-27 to 2019-08-01. The dataset archive files related to this revision are 2019_08_raw.tar and 2019_08_processed.tar. For position pos06, the measurements from 2018-01-06 00:00:00 to 2018-01-10 00:00:00 are filtered (data inconsistency in file indoor1_p27.rld).

    v1.1 (2019-09-09)

    Revision of the processed dataset v1.0 and addition of the final dataset abstract. Updated processing scripts reduce the timestamp drift in the processed dataset, the archive 2019_08_processed.tar has been replaced. For position pos06, the measurements from 2018-01-06 16:00:00 to 2018-01-10 00:00:00 are filtered (indoor1_p27.rld data inconsistency).

    v2.0 (2020-03-20)

    Addition of new data. Includes the raw data collected from 2019-08-01 to 2019-03-16. The processed data is updated with full coverage from 2017-07-27 to 2019-03-16. The dataset archive files related to this revision are 2020_03_raw.tar and 2020_03_processed.tar.

    Dataset Authors, Copyright and License

    Authors: Lukas Sigrist, Andres Gomez, and Lothar Thiele

    Contact: Lukas Sigrist (lukas.sigrist@tik.ee.ethz.ch)

    Copyright: (c) 2017-2019, ETH Zurich, Computer Engineering Group

    License: Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/)

    References

    [1] L. Sigrist, A. Gomez, R. Lim, S. Lippuner, M. Leubin, and L. Thiele. Measurement and validation of energy harvesting IoT devices. In Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.

    [2] ETH Zurich, Computer Engineering Group. RocketLogger Project Website, https://rocketlogger.ethz.ch/.

    [3] L. Sigrist. Solar Harvesting and Ambient Tracing Platform, 2019. https://gitlab.ethz.ch/tec/public/employees/sigristl/harvesting_tracing

  12. Z

    UWB Motion Detection Data Set

    • data.niaid.nih.gov
    Updated Feb 11, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Klemen Bregar; Andrej Hrovat; Mihael Mohorčič (2022). UWB Motion Detection Data Set [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4613124
    Explore at:
    Dataset updated
    Feb 11, 2022
    Dataset provided by
    Institut Jožef Stefan
    Authors
    Klemen Bregar; Andrej Hrovat; Mihael Mohorčič
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction

    This data set includes a collection of measurements using DecaWave DW1000 UWB radios in two indoor environments used for motion detection functionality. Measurements include channel impulse response (CIR) samples in form of power delay profile (PDP) with corresponding timestamps for three channels for each indoor environment.

    Data set includes pieces of Python code and Jupyter notebooks for data loading, analysis and to reproduce the results of a paper entitled "UWB Radio Based Motion Detection System for Assisted Living" submitted to MDPI Sensors.

    The data set will require around 10 GB of total free space after extraction.

    The code included in the data set is written and tested on Linux (Ubuntu 20.04) and requires 16 GB of RAM and additional SWAP partition to run properly. The code can be modified to consume less memory but it requires unnecessary additional work. If the .npy format is compatible with your numpy version, you won't need to regenerate npy data from .csv files.

    Data Set Structure

    The resulting folder after extracting the uwb_motion_detection.zip file is organized as follows:

    data subfolder: contains all original .csv and intermediate .npy data files.

    models

    pdp: this folder contains 4 .csv files with raw PDP measurements (timestamp + PDP). The data format will be discussed in the following section.

    pdp_diff: this folder contains .npy files with PDP samples and .npy files with timestamps. Those files are generated by running the generate_pdp_diff.py script.

    generate_pdp_diff.py

    validation subfolder: contains data for motion detection validation

    events: contains .npy files with motion events for validation. The .npy files are generated using generate_event_x.py files or notebooks inside the /Process/validation folder.

    pdp: this folder contains raw PDP measurements in .csv format.

    pdp_diff: this folder contains .npy files with PDP samples and .npy files with timestamps. Those files are generated by running the generate_pdp_diff.py script.

    generate_events_0.py

    generate_events_1.py

    generate_events_2.py

    generate_pdp_diff.py

    figures subfolder: contains all figures generated in Jupyter notebooks inside the "Process" folder.

    Process subfolder: contains Jupyter notebooks with data processing and motion detection code.

    MotionDetection: contains notebook comparing standard score motion detection with windowed standard score motion detection

    OnlineModels: presents the development process of online models definitions

    PDP_diff: presents the basic principle of PDP differences used in the motion detection

    Validation: presents a motion detection validation process

    Raw data structure

    All .csv files in data folder contain raw PDP measurements with timestamps for each PDP sample. The structure of file goes as follows:

    unix timestamp, cir0 [dBm], cir1 [dBm], cir2[dBm] ... cir149[dBm]

  13. s

    Data for: Emergent ferromagnetism near three-quarters filling in twisted...

    • purl.stanford.edu
    Updated Sep 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aaron Sharpe; Eli Fox; Arthur Barnard; Joe Finney; Kenji Watanabe; Takashi Taniguchi; Marc Kastner; David Goldhaber-Gordon (2022). Data for: Emergent ferromagnetism near three-quarters filling in twisted bilayer graphene [Dataset]. http://doi.org/10.25740/bg095cp1548
    Explore at:
    Dataset updated
    Sep 2, 2022
    Authors
    Aaron Sharpe; Eli Fox; Arthur Barnard; Joe Finney; Kenji Watanabe; Takashi Taniguchi; Marc Kastner; David Goldhaber-Gordon
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This archive contains the data and Python code generating figures for the article "Emergent ferromagnetism near three-quarters filling in twisted bilayer graphene" by Aaron L. Sharpe, Eli J. Fox, Arthur W. Barnard, Joe Finney, Kenji Watanabe, Takashi Taniguchi, M. A. Kastner, and David Goldhaber-Gordon and available at https://arxiv.org/abs/1901.03520. This archive contains the following: 1) TBG_ferromagnetism_figures.ipynb, a Jupyter notebook loading data and generating figures. The notebook has been tested with Python version 3.6.7 and Jupyter notebook server version 5.5.0. 2) HTML_notebook directory that contains 'TBG_ferromagnetism_figures.html' an HTML file generated from the Jupyter notebook, and PNG files loaded by the HTML file, 3) scripts directory that contains additional files used by the Jupyter notebook, and 4) data directory, containing all data used to generate figures for the manuscript, stored as JSON objects. Refer to the notebook for figure captions describing the data.

  14. d

    Data from: A hydroclimatological approach to predicting regional landslide...

    • search.dataone.org
    • hydroshare.org
    Updated Dec 5, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ronda Strauch; Erkan Istanbulluoglu; Sai Siddhartha Nudurupati; Christina Bandaragoda; Nicole Gasparini; Greg Tucker (2021). A hydroclimatological approach to predicting regional landslide probability using Landlab [Dataset]. http://doi.org/10.4211/hs.27d34fc967be4ee6bc1f1ae92657bf2b
    Explore at:
    Dataset updated
    Dec 5, 2021
    Dataset provided by
    Hydroshare
    Authors
    Ronda Strauch; Erkan Istanbulluoglu; Sai Siddhartha Nudurupati; Christina Bandaragoda; Nicole Gasparini; Greg Tucker
    Area covered
    Description

    This resource supports the work published in Strauch et al., (2018) "A hydroclimatological approach to predicting regional landslide probability using Landlab", Earth Surf. Dynam., 6, 1-26 . It demonstrates a hydroclimatological approach to modeling of regional shallow landslide initiation based on the infinite slope stability model coupled with a steady-state subsurface flow representation. The model component is available as the LandslideProbability component in Landlab, an open-source, Python-based landscape earth systems modeling environment described in Hobley et al. (2017, Earth Surf. Dynam., 5, 21–46, https://doi.org/10.5194/esurf-5-21-2017). The model operates on a digital elevation model (DEM) grid to which local field parameters, such as cohesion and soil depth, are attached. A Monte Carlo approach is used to account for parameter uncertainty and calculate probability of shallow landsliding as well as the probability of soil saturation based on annual maximum recharge. The model is demonstrated in a steep mountainous region in northern Washington, U.S.A., using 30-m grid resolution over 2,700 km2.

    This resource contains a 1) User Manual that describes the Landlab LandslideProbability Component design, parameters, and step-by-step guidance on using the component in a model, and 2) two Landlab driver codes (notebooks) and customized component code to run Landlab's LandslideProbability component for 2a) synthetic recharge and 2b) modeled recharge published in Strauch et al., (2018). The Jupyter Notebooks use HydroShare code libraries to import data located at this resource: https://www.hydroshare.org/resource/a5b52c0e1493401a815f4e77b09d352b/.

    The Synthetic Recharge Jupyter Notebook

    The Modeled Recharge Jupyter Notebook

  15. f

    NFDITalk (15 July 2024): Jupyter4NFDI - a central Jupyter Hub for the NFDI

    • meta4ds.fokus.fraunhofer.de
    html
    Updated Jul 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NFDI (2024). NFDITalk (15 July 2024): Jupyter4NFDI - a central Jupyter Hub for the NFDI [Dataset]. https://meta4ds.fokus.fraunhofer.de/datasets/1pgpjyjs8tq?locale=en
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Jul 15, 2024
    Dataset authored and provided by
    NFDI
    Description

    In our NFDITalks, scientists from different disciplines present exciting topics around NFDI and research data management. In this episode, Björn Hagemeier will talk about "Jupyter4NFDI - a central Jupyter Hub for the NFDI".

    Jupyter Notebooks are widespread across scientific disciplines today. However, their deployment across various NFDI consortia currently occurs through individual JupyterHubs, resulting in access barriers to computational and data resources. Whereas some of the services are widely available, others are barricaded within VPNs or otherwise inaccessible for a wider audience. Our ambition is to improve the user experience by offering a centralized service to extend the reach of Jupyter to a broader audience within the NFDI and beyond. The technical foundation for our service will be the versatile configuration frontend that has been proven to meet user needs for the past seven years at JSC. It is continuously extended and traces and ever growing set of backend resources ranging from Cloud based, small-scale JupyterLabs to full-scale remote desktop environments on high-performance computing systems such as Germany's highest-ranked TOP500 system JUWELS Booster.

    Importantly, the centralized system will not only simplify access but also support the import of projects along with their necessary dependencies, fostering an ecosystem conducive to creating reproducible FAIR Digital Objects (FDOs), possibly along with notebook identifiers supported by PID4NFDI.

    In this talk, we'll revisit the history of the current solution, the landscape in which we intend to make it available, and give an outlook on the future of the service.

  16. Z

    Gaia EDR3 Catalogs of Machine-Learned Radial Velocities

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    Updated Jun 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dropulic, Adriana; Liu, Hongwan; Ostdiek, Bryan; Lisanti, Mariangela (2022). Gaia EDR3 Catalogs of Machine-Learned Radial Velocities [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_6558082
    Explore at:
    Dataset updated
    Jun 10, 2022
    Dataset provided by
    Princeton University
    Harvard University & The NSF AI Institute for Artificial Intelligence and Fundamental Interactions
    Princeton University & Center for Computational Astrophysics, Flatiron Institute
    New York University & Princeton University
    Authors
    Dropulic, Adriana; Liu, Hongwan; Ostdiek, Bryan; Lisanti, Mariangela
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Gaia EDR3 Catalogs of Machine-Learned Radial Velocities

    Spatially complete Test-Set and Machine-Learned Radial Velocity (ML-RV) Catalogs described in Dropulic et al., arXiv:2205.12278. The spatially complete Test-Set Catalog contains a total of 4,332,657 stars, while the spatially complete ML-RV Catalog contains 91,840,346 stars. We provide Gaia EDR3 Source IDs, the network-predicted line-of-sight velocity in km/s, and the network-predicted uncertainty in km/s.

    We have included a simple Jupyter notebook demonstrating how to import the data, and make a simple histogram with it.

    If you find this catalog useful in your work, please cite Dropulic et al. arXiv:2205.12278, as well as Dropulic et al. ApJL 915, L14 (2021) arXiv:2103.14039.

  17. d

    Hydroinformatics Instruction Module Example Code: Programmatic Data Access...

    • search.dataone.org
    • hydroshare.org
    • +1more
    Updated Dec 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amber Spackman Jones; Jeffery S. Horsburgh (2023). Hydroinformatics Instruction Module Example Code: Programmatic Data Access with USGS Data Retrieval [Dataset]. https://search.dataone.org/view/sha256%3A3b301506bb2be439d8f330b89de9c36ab074976044b6e5905593f2b7f5be772e
    Explore at:
    Dataset updated
    Dec 30, 2023
    Dataset provided by
    Hydroshare
    Authors
    Amber Spackman Jones; Jeffery S. Horsburgh
    Description

    This resource contains Jupyter Notebooks with examples for accessing USGS NWIS data via web services and performing subsequent analysis related to drought with particular focus on sites in Utah and the southwestern United States (could be modified to any USGS sites). The code uses the Python DataRetrieval package. The resource is part of set of materials for hydroinformatics and water data science instruction. Complete learning module materials are found in HydroLearn: Jones, A.S., Horsburgh, J.S., Bastidas Pacheco, C.J. (2022). Hydroinformatics and Water Data Science. HydroLearn. https://edx.hydrolearn.org/courses/course-v1:USU+CEE6110+2022/about.

    This resources consists of 6 example notebooks: 1. Example 1: Import and plot daily flow data 2. Example 2: Import and plot instantaneous flow data for multiple sites 3. Example 3: Perform analyses with USGS annual statistics data 4. Example 4: Retrieve data and find daily flow percentiles 3. Example 5: Further examination of drought year flows 6. Coding challenge: Assess drought severity

  18. r

    MCCN Case Study 3 - Select optimal survey locality

    • researchdata.edu.au
    Updated Nov 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rakesh David; Lili Andres Hernandez; Hoang Son Le; Donald Hobern; Alisha Aneja (2025). MCCN Case Study 3 - Select optimal survey locality [Dataset]. http://doi.org/10.25909/29176451.V1
    Explore at:
    Dataset updated
    Nov 13, 2025
    Dataset provided by
    The University of Adelaide
    Authors
    Rakesh David; Lili Andres Hernandez; Hoang Son Le; Donald Hobern; Alisha Aneja
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The MCCN project is to deliver tools to assist the agricultural sector to understand crop-environment relationships, specifically by facilitating generation of data cubes for spatiotemporal data. This repository contains Jupyter notebooks to demonstrate the functionality of the MCCN data cube components.

    The dataset contains input files for the case study (source_data), RO-Crate metadata (ro-crate-metadata.json), results from the case study (results), and Jupyter Notebook (MCCN-CASE 3.ipynb)

    Research Activity Identifier (RAiD)

    RAiD: https://doi.org/10.26292/8679d473

    Case Studies

    This repository contains code and sample data for the following case studies. Note that the analyses here are to demonstrate the software and result should not be considered scientifically or statistically meaningful. No effort has been made to address bias in samples, and sample data may not be available at sufficient density to warrant analysis. All case studies end with generation of an RO-Crate data package including the source data, the notebook and generated outputs, including netcdf exports of the datacubes themselves.

    Case Study 3 - Select optimal survey locality

    Given a set of existing survey locations across a variable landscape, determine the optimal site to add to increase the range of surveyed environments. This study demonstrates: 1) Loading heterogeneous data sources into a cube, and 2) Analysis and visualisation using numpy and matplotlib.

    Data Sources

    The primary goal for this case study is to demonstrate being able to import a set of environmental values for different sites and then use these to identify a subset that maximises spread across the various environmental dimensions.

    This is a simple implementation that uses four environmental attributes imported for all Australia (or a subset like NSW) at a moderate grid scale:

    1. Digital soil maps for key soil properties over New South Wales, version 2.0 - SEED - see https://esoil.io/TERNLandscapes/Public/Pages/SLGA/ProductDetails-SoilAttributes.html
    2. ANUCLIM Annual Mean Rainfall raster layer - SEED - see https://datasets.seed.nsw.gov.au/dataset/anuclim-annual-mean-rainfall-raster-layer
    3. ANUCLIM Annual Mean Temperature raster layer - SEED - see https://datasets.seed.nsw.gov.au/dataset/anuclim-annual-mean-temperature-raster-layer

    Dependencies

    • This notebook requires Python 3.10 or higher
    • Install relevant Python libraries with: pip install mccn-engine rocrate
    • Installing mccn-engine will install other dependencies

    Overview

    1. Generate STAC metadata for layers from predefined configuratiion
    2. Load data cube and exclude nodata values
    3. Scale all variables to a 0.0-1.0 range
    4. Select four layers for comparison (soil organic carbon 0-30 cm, soil pH 0-30 cm, mean annual rainfall, mean annual temperature)
    5. Select 10 random points within NSW
    6. Generate 10 new layers representing standardised environmental distance between one of the selected points and all other points in NSW
    7. For every point in NSW, find the lowest environmental distance to any of the selected points
    8. Select the point in NSW that has the highest value for the lowest environmental distance to any selected point - this is the most different point
    9. Clean up and save results to RO-Crate


  19. Sample Park Analysis

    • figshare.com
    zip
    Updated Nov 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eric Delmelle (2025). Sample Park Analysis [Dataset]. http://doi.org/10.6084/m9.figshare.30509021.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 2, 2025
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Eric Delmelle
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    README – Sample Park Analysis## OverviewThis repository contains a Google Colab / Jupyter notebook and accompanying dataset used for analyzing park features and associated metrics. The notebook demonstrates data loading, cleaning, and exploratory analysis of the Hope_Park_original.csv file.## Contents- sample park analysis.ipynb — The main analysis notebook (Colab/Jupyter format)- Hope_Park_original.csv — Source dataset containing park information- README.md — Documentation for the contents and usage## Usage1. Open the notebook in Google Colab or Jupyter.2. Upload the Hope_Park_original.csv file to the working directory (or adjust the file path in the notebook).3. Run each cell sequentially to reproduce the analysis.## RequirementsThe notebook uses standard Python data science libraries:```pythonpandasnumpymatplotlibseaborn

  20. Cognitive Fatigue

    • figshare.com
    csv
    Updated Nov 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rui Varandas; Inês Silveira; Hugo Gamboa (2025). Cognitive Fatigue [Dataset]. http://doi.org/10.6084/m9.figshare.28188143.v3
    Explore at:
    csvAvailable download formats
    Dataset updated
    Nov 5, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Rui Varandas; Inês Silveira; Hugo Gamboa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    1. Cognitive FatigueWhile executing the proposed tasks, the participants’ physiological signals were monitored using two biosignalsplux devices from PLUX Wireless Biosignals, Lisbon, Portugal, with a sampling frequency of 100 Hz a resolution of 16 bits (24 bits in the case of fNIRS). Six different sensors were used: EEG and fNIRS positioned around the F7 and F8 of the 10–20 system (dorsolateral prefrontal cortex is often used to assess CW and fatigue as well as cognitive states); ECG monitored an approximation of Lead I of the Einthoven system, EDA placed on the palm of the non-dominant hand; ACC was positioned on the right side of the head to measure head movement and overall posture changes, and the RIP sensor was attached to the upper-abdominal area to measure the respiration cycles—the combination of the three allows to infer about the response of the Autonomic Nervous System (ANS) of the human body, namely, the response of the sympathetic and parasympathetic nervous system.2.1. Experimental designCognitive fatigue (CF) is a phenomenon that arises following the prolonged engagement in mentally demanding cognitive tasks. Thus, we developed an experimental procedure that involved three demanding tasks: a digital lesson in Jupyter Notebook format, three repetitions of Corsi-Block task, and two repetitions of a concentration test.Before the Corsi-Block task and after the concentration task there were periods of baseline of two min. In our analysis, the first baseline period, although not explicitly present in the dataset, was designated as representing no CF, whereas the final baseline period was designated as representing the presence of CF. Between repetitions of the Corsi-Block task, there were periods of baseline of 15 s after the task and of 30 s before the beginning of each repetition of the task.2.2. Data recordingA data sample of 10 volunteer participants (4 females) aged between 22 and 48 years old (M = 28.2, SD = 7.6) took part in this study. All volunteers were recruited at NOVA School of Science and Technology, fluent in English, right-handed, none reported suffering from psychological disorders, and none reported taking regular medication. Written informed consent was obtained before participating and all Ethical Procedures approved by the Ethics Committee of NOVA University of Lisbon were thoroughly followed.In this study, we omitted the data from one participant due to the insufficient duration of data acquisition.2.3. Data labellingThe labels easy, difficult, very difficult and repeat found in the ECG_lesson_answers.txt files represent the subjects' opinion of the content read in the ECG lesson. The repeat label represents the most difficult level. It's called repeat because when you press it, the answer to the question is shown again. This system is based on the Anki system, which has been proposed and used to memorise information effectively. In addition, the PB description JSON files include timestamps indicating the start and end of cognitive tasks, baseline periods, and other events, which are useful for defining CF states as we defined in 2.1.2.4. Data descriptionBiosignals include EEG, fNIRS (not converted to oxi and deoxiHb), ECG, EDA, respiration (RIP), accelerometer (ACC), and push-button data (PB). All signals have already been converted to physical units. In each biosignal file, the first column corresponds to the timestamps.HCI features encompass keyboard, mouse, and screenshot data. Below is a Python code snippet for extracting screenshot files from the screenshots CSV file.import base64from os import mkdirfrom os.path import joinfile = '...'with open(file, 'r') as f: lines = f.readlines()for line in lines[1:]: timestamp = line.split(',')[0] code = line.split(',')[-1][:-2] imgdata = base64.b64decode(code) filename = str(timestamp) + '.jpeg' mkdir('screenshot') with open(join('screenshot', filename), 'wb') as f: f.write(imgdata)A characterization file containing age and gender information for all subjects in each dataset is provided within the respective dataset folder (e.g., D2_subject-info.csv). Other complementary files include (i) description of the pushbuttons to help segment the signals (e.g., D2_S2_PB_description.json) and (ii) labelling (e.g., D2_S2_ECG_lesson_results.txt). The files D2_Sx_results_corsi-block_board_1.json and D2_Sx_results_corsi-block_board_2.json show the results for the first and second iterations of the corsi-block task, where, for example, row_0_1 = 12 means that the subject got 12 pairs right in the first row of the first board, and row_0_2 = 12 means that the subject got 12 pairs right in the first row of the second board.
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana (2021). Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks / Understanding and Improving the Quality and Reproducibility of Jupyter Notebooks [Dataset]. http://doi.org/10.5281/zenodo.3519618
Organization logo

Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks / Understanding and Improving the Quality and Reproducibility of Jupyter Notebooks

Explore at:
application/gzipAvailable download formats
Dataset updated
Mar 16, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourages poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub. Based on the results, we proposed and evaluated Julynter, a linting tool for Jupyter Notebooks.

Papers:

This repository contains three files:

Reproducing the Notebook Study

The db2020-09-22.dump.gz file contains a PostgreSQL dump of the database, with all the data we extracted from notebooks. For loading it, run:

gunzip -c db2020-09-22.dump.gz | psql jupyter

Note that this file contains only the database with the extracted data. The actual repositories are available in a google drive folder, which also contains the docker images we used in the reproducibility study. The repositories are stored as content/{hash_dir1}/{hash_dir2}.tar.bz2, where hash_dir1 and hash_dir2 are columns of repositories in the database.

For scripts, notebooks, and detailed instructions on how to analyze or reproduce the data collection, please check the instructions on the Jupyter Archaeology repository (tag 1.0.0)

The sample.tar.gz file contains the repositories obtained during the manual sampling.

Reproducing the Julynter Experiment

The julynter_reproducility.tar.gz file contains all the data collected in the Julynter experiment and the analysis notebooks. Reproducing the analysis is straightforward:

  • Uncompress the file: $ tar zxvf julynter_reproducibility.tar.gz
  • Install the dependencies: $ pip install julynter/requirements.txt
  • Run the notebooks in order: J1.Data.Collection.ipynb; J2.Recommendations.ipynb; J3.Usability.ipynb.

The collected data is stored in the julynter/data folder.

Changelog

2019/01/14 - Version 1 - Initial version
2019/01/22 - Version 2 - Update N8.Execution.ipynb to calculate the rate of failure for each reason
2019/03/13 - Version 3 - Update package for camera ready. Add columns to db to detect duplicates, change notebooks to consider them, and add N1.Skip.Notebook.ipynb and N11.Repository.With.Notebook.Restriction.ipynb.
2021/03/15 - Version 4 - Add Julynter experiment; Update database dump to include new data collected for the second paper; remove scripts and analysis notebooks from this package (moved to GitHub), add a link to Google Drive with collected repository files

Search
Clear search
Close search
Google apps
Main menu