22 datasets found

Speedtest Open Data - Four International cities - MEL, BKK, SHG, LAX plus...
figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Richard Ferrers; Speedtest Global Index (2023). Speedtest Open Data - Four International cities - MEL, BKK, SHG, LAX plus ALC - 2020, 2022 [Dataset]. http://doi.org/10.6084/m9.figshare.13621169.v24
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.13621169.v24
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Richard Ferrers; Speedtest Global Index
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset compares four cities FIXED-line broadband internet speeds: - Melbourne, AU - Bangkok, TH - Shanghai, CN - Los Angeles, US - Alice Springs, AU

ERRATA: 1.Data is for Q3 2020, but some files are labelled incorrectly as 02-20 of June 20. They all should read Sept 20, or 09-20 as Q3 20, rather than Q2. Will rename and reload. Amended in v7.

LAX file named 0320, when should be Q320. Amended in v8.

*lines of data for each geojson file; a line equates to a 600m^2 location, inc total tests, devices used, and average upload and download speed - MEL 16181 locations/lines => 0.85M speedtests (16.7 tests per 100people) - SHG 31745 lines => 0.65M speedtests (2.5/100pp) - BKK 29296 lines => 1.5M speedtests (14.3/100pp) - LAX 15899 lines => 1.3M speedtests (10.4/100pp) - ALC 76 lines => 500 speedtests (2/100pp)

Geojsons of these 2* by 2* extracts for MEL, BKK, SHG now added, and LAX added v6. Alice Springs added v15.

This dataset unpacks, geospatially, data summaries provided in Speedtest Global Index (linked below). See Jupyter Notebook (*.ipynb) to interrogate geo data. See link to install Jupyter.

** To Do Will add Google Map versions so everyone can see without installing Jupyter. - Link to Google Map (BKK) added below. Key:Green > 100Mbps(Superfast). Black > 500Mbps (Ultrafast). CSV provided. Code in Speedtestv1.1.ipynb Jupyter Notebook. - Community (Whirlpool) surprised [Link: https://whrl.pl/RgAPTl] that Melb has 20% at or above 100Mbps. Suggest plot Top 20% on map for community. Google Map link - now added (and tweet).

** Python melb = au_tiles.cx[144:146 , -39:-37] #Lat/Lon extract shg = tiles.cx[120:122 , 30:32] #Lat/Lon extract bkk = tiles.cx[100:102 , 13:15] #Lat/Lon extract lax = tiles.cx[-118:-120, 33:35] #lat/Lon extract ALC=tiles.cx[132:134, -22:-24] #Lat/Lon extract

Histograms (v9), and data visualisations (v3,5,9,11) will be provided. Data Sourced from - This is an extract of Speedtest Open data available at Amazon WS (link below - opendata.aws).

**VERSIONS v.24 Add tweet and google map of Top 20% (over 100Mbps locations) in Mel Q322. Add v.1.5 MEL-Superfast notebook, and CSV of results (now on Google Map; link below). v23. Add graph of 2022 Broadband distribution, and compare 2020 - 2022. Updated v1.4 Jupyter notebook. v22. Add Import ipynb; workflow-import-4cities. v21. Add Q3 2022 data; five cities inc ALC. Geojson files. (2020; 4.3M tests 2022; 2.9M tests)

Melb 14784 lines Avg download speed 69.4M Tests 0.39M

SHG 31207 lines Avg 233.7M Tests 0.56M

ALC 113 lines Avg 51.5M Test 1092

BKK 29684 lines Avg 215.9M Tests 1.2M

LAX 15505 lines Avg 218.5M Tests 0.74M

v20. Speedtest - Five Cities inc ALC. v19. Add ALC2.ipynb. v18. Add ALC line graph. v17. Added ipynb for ALC. Added ALC to title.v16. Load Alice Springs Data Q221 - csv. Added Google Map link of ALC. v15. Load Melb Q1 2021 data - csv. V14. Added Melb Q1 2021 data - geojson. v13. Added Twitter link to pics. v12 Add Line-Compare pic (fastest 1000 locations) inc Jupyter (nbn-intl-v1.2.ipynb). v11 Add Line-Compare pic, plotting Four Cities on a graph. v10 Add Four Histograms in one pic. v9 Add Histogram for Four Cities. Add NBN-Intl.v1.1.ipynb (Jupyter Notebook). v8 Renamed LAX file to Q3, rather than 03. v7 Amended file names of BKK files to correctly label as Q3, not Q2 or 06. v6 Added LAX file. v5 Add screenshot of BKK Google Map. v4 Add BKK Google map(link below), and BKK csv mapping files. v3 replaced MEL map with big key version. Prev key was very tiny in top right corner. v2 Uploaded MEL, SHG, BKK data and Jupyter Notebook v1 Metadata record

** LICENCE AWS data licence on Speedtest data is "CC BY-NC-SA 4.0", so use of this data must be: - non-commercial (NC) - reuse must be share-alike (SA)(add same licence). This restricts the standard CC-BY Figshare licence.

** Other uses of Speedtest Open Data; - see link at Speedtest below.
Amazon Web Scrapping Dataset
kaggle.com
zip
Updated Jun 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohammad Hurairah (2023). Amazon Web Scrapping Dataset [Dataset]. https://www.kaggle.com/datasets/mohammadhurairah/amazon-web-scrapper-dataset
Explore at:
zip(2220 bytes)Available download formats
Dataset updated
Jun 17, 2023
Authors
Mohammad Hurairah
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Amazon Scrapping Dataset; 1. Import libraries 2. Connect to the website 3. Import CSV and datetime 4. Import pandas 5. Appending dataset to CSV 6. Automation Dataset updated 7. Timers setup 8. Email notification
Sample Park Analysis
figshare.com
zip
Updated Nov 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eric Delmelle (2025). Sample Park Analysis [Dataset]. http://doi.org/10.6084/m9.figshare.30509021.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.30509021.v1
Dataset updated
Nov 2, 2025
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Eric Delmelle
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
README – Sample Park Analysis## OverviewThis repository contains a Google Colab / Jupyter notebook and accompanying dataset used for analyzing park features and associated metrics. The notebook demonstrates data loading, cleaning, and exploratory analysis of the Hope_Park_original.csv file.## Contents- sample park analysis.ipynb — The main analysis notebook (Colab/Jupyter format)- Hope_Park_original.csv — Source dataset containing park information- README.md — Documentation for the contents and usage## Usage1. Open the notebook in Google Colab or Jupyter.2. Upload the Hope_Park_original.csv file to the working directory (or adjust the file path in the notebook).3. Run each cell sequentially to reproduce the analysis.## RequirementsThe notebook uses standard Python data science libraries:```pythonpandasnumpymatplotlibseaborn
OpenOrca
kaggle.com
opendatalab.com
+1more
zip
Updated Nov 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). OpenOrca [Dataset]. https://www.kaggle.com/datasets/thedevastator/open-orca-augmented-flan-dataset/versions/2
Explore at:
zip(2548102631 bytes)Available download formats
Dataset updated
Nov 22, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Open-Orca Augmented FLAN Dataset

Unlocking Advanced Language Understanding and ML Model Performance

By Huggingface Hub [source]

About this dataset

The Open-Orca Augmented FLAN Collection is a revolutionary dataset that unlocks new levels of language understanding and machine learning model performance. This dataset was created to support research on natural language processing, machine learning models, and language understanding through leveraging the power of reasoning trace-enhancement techniques. By enabling models to understand complex relationships between words, phrases, and even entire sentences in a more robust way than ever before, this dataset provides researchers expanded opportunities for furthering the progress of linguistics research. With its unique combination of features including system prompts, questions from users and responses from systems, this dataset opens up exciting possibilities for deeper exploration into the cutting edge concepts underlying advanced linguistics applications. Experience a new level of accuracy and performance - explore Open-Orca Augmented FLAN Collection today!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This guide provides an introduction to the Open-Orca Augmented FLAN Collection dataset and outlines how researchers can utilize it for their language understanding and natural language processing (NLP) work. The Open-Orca dataset includes system prompts, questions posed by users, and responses from the system.

Getting Started The first step is to download the data set from Kaggle at https://www.kaggle.com/openai/open-orca-augmented-flan and save it in a project directory of your choice on your computer or cloud storage space. Once you have downloaded the data set, launch your ‘Jupyter Notebook’ or ‘Google Colab’ program with which you want to work with this data set.

Exploring & Preprocessing Data: To get a better understanding of the features in this dataset, import them into Pandas DataFrame as shown below. You can use other libraries as per your need:

import pandas as pd # Library used for importing datasets into Python df = pd.read_csv('train.csv') #Imports train csv file into Pandas};#DataFrame df[['system_prompt','question','response']].head() #Views top 5 rows with columns 'system_prompt','question','response'

After importing check each feature using basic descriptive statistics such Pandas groupby statement: We can use groupby statements to have greater clarity over the variables present in each feature(elements). The below command will show counts of each element in System Prompt column present under train CVS file :

df['system prompt'].value_counts().head()#shows count of each element present under 'System Prompt'column Output: User says hello guys 587 <br>System asks How are you?: 555 times<br>User says I am doing good: 487 times <br>..and so on

Data Transformation: After inspecting & exploring different features one may want/need certain changes that best suits their needs from this dataset before training modeling algorithms on it.
Common transformation steps include : Removing punctuation marks : Since punctuation marks may not add any value to computation operations , we can remove them using regex functions write .replace('[^A-Za -z]+','' ) as

Research Ideas

Automated Question Answering: Leverage the dataset to train and develop question answering models that can provide tailored answers to specific user queries while retaining language understanding abilities.

Natural Language Understanding: Use the dataset as an exploratory tool for fine-tuning natural language processing applications, such as sentiment analysis, document categorization, parts-of-speech tagging and more.

Machine Learning Optimizations: The dataset can be used to build highly customized machine learning pipelines that allow users to harness the power of conditioning data with pre-existing rules or models for improved accuracy and performance in automated tasks

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. [See Other Information](ht...
Galaxy Training Material for the 'Use Jupyter notebooks in Galaxy' tutorial
zenodo.org
csv
Updated Apr 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Delphine Lariviere; Delphine Lariviere; Teresa Müller; Teresa Müller (2025). Galaxy Training Material for the 'Use Jupyter notebooks in Galaxy' tutorial [Dataset]. http://doi.org/10.5281/zenodo.15263830
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15263830
Dataset updated
Apr 22, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Delphine Lariviere; Delphine Lariviere; Teresa Müller; Teresa Müller
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset was originally curated by Software Carpentry, a branch of The Carpentries non-profit organization, and is based on data from the Gapminder Foundation. It consists of six tabular CSV files containing GDP data for various countries across different years. The dataset was initially prepared for the Software Carpentry tutorial "Plotting and Programming in Python" and is also reused in the Galaxy Training Network (GTN) tutorial "Use Jupyter Notebooks in Galaxy."

This GTN tutorial provides an introduction to launching a Jupyter Notebook in Galaxy, installing dependencies, and importing and exporting data. It serves as a setup guide for a Jupyter Notebook environment that can be used to follow the Software Carpentry tutorial "Plotting and Programming in Python."
h
synthetic_credit_card_default
huggingface.co
Updated Aug 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Syncora.ai - Agentic Synthetic Data Platform (2025). synthetic_credit_card_default [Dataset]. https://huggingface.co/datasets/syncora/synthetic_credit_card_default
Explore at:
Dataset updated
Aug 14, 2025
Authors
Syncora.ai - Agentic Synthetic Data Platform
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Synthetic Credit Card Default Dataset

High-fidelity synthetic dataset for financial AI research, created with Syncora.ai ✅ What's in This Repo?

This repository includes:

✅ Synthetic Credit Card Default Dataset (CSV) → Download Here ✅ Jupyter Notebook for Analysis & Modeling → Open Notebook ✅ Instructions for generating your own synthetic data using Syncora API

📘 About This Dataset

This dataset contains realistic, fully synthetic credit card… See the full description on the dataset page: https://huggingface.co/datasets/syncora/synthetic_credit_card_default.
Cognitive Fatigue
figshare.com
csv
Updated Nov 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rui Varandas; Inês Silveira; Hugo Gamboa (2025). Cognitive Fatigue [Dataset]. http://doi.org/10.6084/m9.figshare.28188143.v3
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28188143.v3
Dataset updated
Nov 5, 2025
Dataset provided by
Figsharehttp://figshare.com/
Authors
Rui Varandas; Inês Silveira; Hugo Gamboa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Cognitive FatigueWhile executing the proposed tasks, the participants’ physiological signals were monitored using two biosignalsplux devices from PLUX Wireless Biosignals, Lisbon, Portugal, with a sampling frequency of 100 Hz a resolution of 16 bits (24 bits in the case of fNIRS). Six different sensors were used: EEG and fNIRS positioned around the F7 and F8 of the 10–20 system (dorsolateral prefrontal cortex is often used to assess CW and fatigue as well as cognitive states); ECG monitored an approximation of Lead I of the Einthoven system, EDA placed on the palm of the non-dominant hand; ACC was positioned on the right side of the head to measure head movement and overall posture changes, and the RIP sensor was attached to the upper-abdominal area to measure the respiration cycles—the combination of the three allows to infer about the response of the Autonomic Nervous System (ANS) of the human body, namely, the response of the sympathetic and parasympathetic nervous system.2.1. Experimental designCognitive fatigue (CF) is a phenomenon that arises following the prolonged engagement in mentally demanding cognitive tasks. Thus, we developed an experimental procedure that involved three demanding tasks: a digital lesson in Jupyter Notebook format, three repetitions of Corsi-Block task, and two repetitions of a concentration test.Before the Corsi-Block task and after the concentration task there were periods of baseline of two min. In our analysis, the first baseline period, although not explicitly present in the dataset, was designated as representing no CF, whereas the final baseline period was designated as representing the presence of CF. Between repetitions of the Corsi-Block task, there were periods of baseline of 15 s after the task and of 30 s before the beginning of each repetition of the task.2.2. Data recordingA data sample of 10 volunteer participants (4 females) aged between 22 and 48 years old (M = 28.2, SD = 7.6) took part in this study. All volunteers were recruited at NOVA School of Science and Technology, fluent in English, right-handed, none reported suffering from psychological disorders, and none reported taking regular medication. Written informed consent was obtained before participating and all Ethical Procedures approved by the Ethics Committee of NOVA University of Lisbon were thoroughly followed.In this study, we omitted the data from one participant due to the insufficient duration of data acquisition.2.3. Data labellingThe labels easy, difficult, very difficult and repeat found in the ECG_lesson_answers.txt files represent the subjects' opinion of the content read in the ECG lesson. The repeat label represents the most difficult level. It's called repeat because when you press it, the answer to the question is shown again. This system is based on the Anki system, which has been proposed and used to memorise information effectively. In addition, the PB description JSON files include timestamps indicating the start and end of cognitive tasks, baseline periods, and other events, which are useful for defining CF states as we defined in 2.1.2.4. Data descriptionBiosignals include EEG, fNIRS (not converted to oxi and deoxiHb), ECG, EDA, respiration (RIP), accelerometer (ACC), and push-button data (PB). All signals have already been converted to physical units. In each biosignal file, the first column corresponds to the timestamps.HCI features encompass keyboard, mouse, and screenshot data. Below is a Python code snippet for extracting screenshot files from the screenshots CSV file.import base64from os import mkdirfrom os.path import joinfile = '...'with open(file, 'r') as f: lines = f.readlines()for line in lines[1:]: timestamp = line.split(',')[0] code = line.split(',')[-1][:-2] imgdata = base64.b64decode(code) filename = str(timestamp) + '.jpeg' mkdir('screenshot') with open(join('screenshot', filename), 'wb') as f: f.write(imgdata)A characterization file containing age and gender information for all subjects in each dataset is provided within the respective dataset folder (e.g., D2_subject-info.csv). Other complementary files include (i) description of the pushbuttons to help segment the signals (e.g., D2_S2_PB_description.json) and (ii) labelling (e.g., D2_S2_ECG_lesson_results.txt). The files D2_Sx_results_corsi-block_board_1.json and D2_Sx_results_corsi-block_board_2.json show the results for the first and second iterations of the corsi-block task, where, for example, row_0_1 = 12 means that the subject got 12 pairs right in the first row of the first board, and row_0_2 = 12 means that the subject got 12 pairs right in the first row of the second board.
Open University Learning Analytics Dataset
kaggle.com
zip
Updated Dec 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Open University Learning Analytics Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/open-university-learning-analytics-dataset
Explore at:
zip(44203263 bytes)Available download formats
Dataset updated
Dec 21, 2023
Authors
The Devastator
Description
Open University Learning Analytics Dataset

Student Performance and Engagement Data at The Open University

By UCI [source]

About this dataset

This dataset provides an intimate look into student performance and engagement. It grants researchers access to numerous salient metrics of academic performance which illuminate a broad spectrum of student behaviors: how students interact with online learning material; quantitative indicators reflecting their academic outcomes; as well as demographic data such as age group, gender, prior education level among others.

The main objective of this dataset is to enable analysts and educators alike with empirical insights underpinning individualized learning experiences - specifically in identifying cases when students may be 'at risk'. Given that preventive early interventions have been shown to significantly mitigate chances of course or program withdrawal among struggling students - having accurate predictive measures such as this can greatly steer pedagogical strategies towards being more success oriented.

One unique feature about this dataset is its intricate detailing. Not only does it provide overarching summaries on a per-student basis for each presented courses but it also furnishes data related to assessments (scores & submission dates) along with information on individuals' interactions within VLEs (virtual learning environments) - spanning different types like forums, content pages etc... Such comprehensive collation across multiple contextual layers helps paint an encompassing portrayal of student experience that can guide better instructional design.

Due credit must be given when utilizing this database for research purposes through citation. Specifically referencing (Kuzilek et al., 2015) OU Analyse: Analysing At-Risk Students at The Open University published in Learning Analytics Review is required due to its seminal work related groundings regarding analysis methodologies stem from there.

Immaterial aspects aside - it is important to note that protection of student privacy is paramount within this dataset's terms and conditions. Stringent anonymization techniques have been implemented across sensitive variables - while detailed, profiles can't be traced back to original respondents.

How to use the dataset

How To Use This Dataset:

Understanding Your Objectives: Ideal objectives for using this dataset could be to identify at-risk students before they drop out of a class or program, improving course design by analyzing how assignments contribute to final grades, or simply examining relationships between different variables and student performance.

Set up your Analytical Environment: Before starting any analysis make sure you have an analytical environment set up where you can load the CSV files included in this dataset. You can use Python notebooks (Jupyter), R Studio or Tableau based software in case you want visual representation as well.

Explore Data Individually: There are seven separate datasets available: Assessments; Courses; Student Assessment; Student Info; Vle (Virtual Learning Environment); Student Registeration and Student Vle. Load these CSVs separately into your environment and do an initial exploration of each one: find out what kind of data they contain (numerical/categorical), if they have missing values etc.

Merge Datasets As the core idea is to track a student’s journey through multiple courses over time, combining these datasets will provide insights from wider perspectives. One way could be merging them using common key columns such as 'code_module', 'code_presentation', & 'id_student'. But make sure that merge should depend on what question you're trying to answer.

Identify Key Metrics Your key metrics will depend on your objectives but might include: overall grade averages per course or assessment type/student/region/gender/age group etc., number of clicks in virtual learning environment, student registration status etc.

Run Your Analysis Now you can run queries to analyze the data relevant to your objectives. Try questions like: What factors most strongly predict whether a student will fail an assessment? or How does course difficulty or the number of allotments per week change students' scores?

Visualization: Visualizing your data can be crucial for understanding patterns and relationships between variables. Use graphs like bar plots, heatmaps, and histograms to represent different aspects of your analyses.

Actionable Insights: The final step is interpreting these results in ways that are meaningf...

Articles metadata from CrossRef

kaggle.com

zip

Updated Aug 1, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Kea Kohv (2025). Articles metadata from CrossRef [Dataset]. https://www.kaggle.com/datasets/keakohv/articles-doi-metadata

Explore at:

zip(72982417 bytes)Available download formats

Dataset updated

Aug 1, 2025

Authors

Kea Kohv

Description

This data originates from Crossref API. It has metadata on the articles contained in Data Citation Corpus where the citation pair dataset is a DOI.

How to recreate this dataset in Jupyter Notebook:

1) Prepare list of articles to query ```python import pandas as pd

See: https://www.kaggle.com/datasets/keakohv/data-citation-coprus-v4-1-eupmc-and-datacite

CITATIONS_PARQUET = "data_citation_corpus_filtered_v4.1.parquet"

Load the citation pairs from the Parquet file

citation_pairs = pd.read_parquet(CITATIONS_PARQUET)

Remove all rows where https is in the 'publication' column but no "doi.org" is present

citation_pairs = citation_pairs[ ~((citation_pairs['dataset'].str.contains("https")) & (~citation_pairs['dataset'].str.contains("doi.org"))) ]

Remove all rows where figshare is in the dataset name

citation_pairs = citation_pairs[ ~citation_pairs['dataset'].str.contains("figshare") ]

citation_pairs['is_doi'] = citation_pairs['dataset'].str.contains('doi.org', na=False)

citation_pairs_doi = citation_pairs[citation_pairs['is_doi'] == True].copy()

articles = list(set(citation_pairs_doi['publication'].to_list()))

articles = [doi.replace("_", "/") for doi in articles]

Save list articles to a file

with open("articles.txt", "w") as f: for article in articles: f.write(f"{article} ") ```

2) Query articles from CrossRef API


%%writefile enrich.py
#!pip install -q aiolimiter
import sys, pathlib, asyncio, aiohttp, orjson, sqlite3, time
from aiolimiter import AsyncLimiter

# ---------- config ----------
HEADERS  = {"User-Agent": "ForDataCiteEnrichment (mailto:your_email)"} # Put your email here
MAX_RPS  = 45           # polite pool limit (50), leave head-room
BATCH_SIZE = 10_000         # rows per INSERT
DB_PATH  = pathlib.Path("crossref.sqlite").resolve()
ARTICLES  = pathlib.Path("articles.txt")
# -----------------------------

# ---- platform tweak: prefer selector loop on Windows ----
if sys.platform == "win32":
  asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())

# ---- read the DOI list ----
with ARTICLES.open(encoding="utf-8") as f:
  DOIS = [line.strip() for line in f if line.strip()]

# ---- make sure DB & table exist BEFORE the async part ----
DB_PATH.parent.mkdir(parents=True, exist_ok=True)
with sqlite3.connect(DB_PATH) as db:
  db.execute("""
    CREATE TABLE IF NOT EXISTS works (
      doi  TEXT PRIMARY KEY,
      json TEXT
    )
  """)
  db.execute("PRAGMA journal_mode=WAL;")   # better concurrency

# ---------- async section ----------
limiter = AsyncLimiter(MAX_RPS, 1)       # 45 req / second
sem   = asyncio.Semaphore(100)        # cap overall concurrency

async def fetch_one(session, doi: str):
  url = f"https://api.crossref.org/works/{doi}"
  async with limiter, sem:
    try:
      async with session.get(url, headers=HEADERS, timeout=10) as r:
        if r.status == 404:         # common “not found”
          return doi, None
        r.raise_for_status()        # propagate other 4xx/5xx
        return doi, await r.json()
    except Exception as e:
      return doi, None            # log later, don’t crash

async def main():
  start = time.perf_counter()
  db  = sqlite3.connect(DB_PATH)        # KEEP ONE connection
  db.execute("PRAGMA synchronous = NORMAL;")   # speed tweak

  async with aiohttp.ClientSession(json_serialize=orjson.dumps) as s:
    for chunk_start in range(0, len(DOIS), BATCH_SIZE):
      slice_ = DOIS[chunk_start:chunk_start + BATCH_SIZE]
      tasks = [asyncio.create_task(fetch_one(s, d)) for d in slice_]
      results = await asyncio.gather(*tasks)    # all tuples, no exc

      good_rows, bad_dois = [], []
      for doi, payload in results:
        if payload is None:
          bad_dois.append(doi)
        else:
          good_rows.append((doi, orjson.dumps(payload).decode()))

      if good_rows:
        db.executemany(
          "INSERT OR IGNORE INTO works (doi, json) VALUES (?, ?)",
          good_rows,
        )
        db.commit()

      if bad_dois:                # append for later retry
        with open("failures.log", "a", encoding="utf-8") as fh:
          fh.writelines(f"{d}
" for d in bad_dois)

      done = chunk_start + len(slice_)
      rate = done / (time.perf_counter() - start)
      print(f"{done:,}/{len(DOIS):,} ({rate:,.1f} DOI/s)")

  db.close()

if _name_ == "_main_":
  asyncio.run(main())

Then run: python !python enrich.py

3) Finally extract the necessary fields

import sqlite3
import orjson
i...

Cleaned Contoso Dataset
kaggle.com
zip
Updated Aug 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bhanu (2023). Cleaned Contoso Dataset [Dataset]. https://www.kaggle.com/datasets/bhanuthakurr/cleaned-contoso-dataset
Explore at:
zip(487695063 bytes)Available download formats
Dataset updated
Aug 27, 2023
Authors
Bhanu
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Data was imported from the BAK file found here into SQL Server, and then individual tables were exported as CSV. Jupyter Notebook containing the code used to clean the data can be found here

Version 6 has a some more cleaning and structuring that was noticed after importing in Power BI. Changes were made by adding code in python notebook to export new cleaned dataset, such as adding MonthNumber for sorting by month number, similar for WeekDayNumber.

Cleaning was done in python while also using SQL Server to quickly find things. Headers were added separately, ensuring no data loss.Data was cleaned for NaN, garbage values and other columns.
Z
The S&M-HSTPM2d5 dataset: High Spatial-Temporal Resolution PM 2.5 Measures...
data.niaid.nih.gov
Updated Sep 25, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chen, Xinlei; Liu, Xinyu; Eng, Kent X.; Liu, Jingxiao; Noh, Hae Young; Zhang, Lin; Zhang, Pei (2020). The S&M-HSTPM2d5 dataset: High Spatial-Temporal Resolution PM 2.5 Measures in Multiple Cities Sensed by Static & Mobile Devices [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4028129
Explore at:
Dataset updated
Sep 25, 2020
Dataset provided by
Carnegie Mellon University
Stanford University
Tsinghua University
Authors
Chen, Xinlei; Liu, Xinyu; Eng, Kent X.; Liu, Jingxiao; Noh, Hae Young; Zhang, Lin; Zhang, Pei
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This S&M-HSTPM2d5 dataset contains the high spatial and temporal resolution of the particulates (PM2.5) measures with the corresponding timestamp and GPS location of mobile and static devices in the three Chinese cities: Foshan, Cangzhou, and Tianjin. Different numbers of static and mobile devices were set up in each city. The sampling rate was set up as one minute in Cangzhou, and three seconds in Foshan and Tianjin. For the specific detail of the setup, please refer to the Device_Setup_Description.txt file in this repository and the data descriptor paper.

After the data collection process, the data cleaning process was performed to remove and adjust the abnormal and drifting data. The script of the data cleaning algorithm is provided in this repository. The data cleaning algorithm only adjusts or removes individual data points. The removal of the entire device's data was done after the data cleaning algorithm with empirical judgment and graphic visualization. For specific detail of the data cleaning process, please refer to the script (Data_cleaning_algorithm.ipynb) in this repository and the data descriptor paper.

The dataset in this repository is the processed version. The raw dataset and removed devices are not included in this repository.

The data is stored as a CSV file. Each CSV file which is named by the device ID represents the data that was collected by the corresponding device. Each CSV file has three types of data: timestamp as the China Standard Time (GMT+8), geographic location as latitude and longitude, and PM2.5 concentration with the unit of microgram per cubic meter. The CSV files are stored in either Static or Mobile folder which represents the devices' type. The Static and Mobile folder are stored in the corresponding city's folder.

To access the dataset, any programming language that can access CSV files is appropriate. Users can also open the CSV file directly. The get_dataset.ipynb file in this repository also provides an option of accessing the dataset. To successfully execute ipynb file, Jupyter Notebook with Python 3.0 is required. The following python library is also required:

get_dataset.ipynb: 1. os library 2. pandas library

Data_cleaning_algorithm.ipynb: 1. os library 2. pandas library 3. datetime library 4. math library

The instruction of installing the libraries above can be found online. After installing the Jupyter Notebook with Python 3.0 and the required libraries, users can try to open the ipynb file with Jupyter Notebook and follow the instruction inside the file.

For questions or suggestions please e-mail Xinlei Chen
Student Performance and Learning Behavior Dataset for Educational Analytics
zenodo.org
bin, csv
Updated Aug 13, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kamal NAJEM; Kamal NAJEM (2025). Student Performance and Learning Behavior Dataset for Educational Analytics [Dataset]. http://doi.org/10.5281/zenodo.16459132
Explore at:
bin, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.16459132
Dataset updated
Aug 13, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Kamal NAJEM; Kamal NAJEM
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jul 26, 2025
Description
The dataset used in this study integrates quantitative data on student learning behaviors, engagement patterns, demographics, and academic performance. It was compiled by merging two publicly available Kaggle datasets, resulting in a combined file (“merged_dataset.csv”) containing 14,003 student records with 16 attributes. All records are anonymized and contain no personally identifiable information.

The dataset covers the following categories of variables:

Study behaviors and engagement: StudyHours, Attendance, Extracurricular, AssignmentCompletion, OnlineCourses, Discussions

Resource access and learning environment: Resources, Internet, EduTech

Motivation and psychological factors: Motivation, StressLevel

Demographic information: Gender, Age (ranging from 18 to 30 years)

Learning preference classification: LearningStyle

Academic performance indicators: ExamScore, FinalGrade

In this study, “ExamScore” and “FinalGrade” served as the primary performance indicators. The remaining variables were used to derive behavioral and contextual profiles, which were clustered using unsupervised machine learning techniques.

The analysis and modeling were implemented in Python through a structured Jupyter Notebook (“Project.ipynb”), which included the following main steps:

Environment Setup – Import of essential libraries (NumPy, pandas, Matplotlib, Seaborn, SciPy, StatsModels, scikit-learn, imbalanced-learn) and visualization configuration.

Data Import and Integration – Loading the two source CSV files, harmonizing columns, removing irrelevant attributes, aligning formats, handling missing values, and merging them into a unified dataset (merged_dataset.csv).

Data Preprocessing –

Encoding categorical variables using LabelEncoder.

Scaling features using both z-score standardization (for statistical tests and PCA) and Min–Max normalization (for clustering).

Detecting and removing duplicates.

Clustering Analysis –

Applying K-Means clustering to segment learners into distinct profiles.

Determining the optimal number of clusters using the Elbow Method and Silhouette Score.

Evaluating cluster quality with internal metrics (Silhouette Score, Davies–Bouldin Index).

Dimensionality Reduction & Visualization – Using PCA for 2D/3D cluster visualization and feature importance exploration.

Mapping Clusters to Learning Styles – Associating each identified cluster with the most relevant learning style model based on feature patterns and alignment scores.

Statistical Analysis – Conducting ANOVA and regression to test for significant differences in performance between clusters.

Interpretation & Practical Recommendations – Analyzing cluster-specific characteristics and providing implications for adaptive and mobile learning integration.
Kalimati Tarkari Dataset Updated
kaggle.com
zip
Updated Jun 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bibek Maharjan (2025). Kalimati Tarkari Dataset Updated [Dataset]. https://www.kaggle.com/datasets/abnormalbbk/kalimati-tarkari-dataset-updated
Explore at:
zip(1987302 bytes)Available download formats
Dataset updated
Jun 30, 2025
Authors
Bibek Maharjan
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
Kalimati
Description
Dive into daily fresh insights from Nepal’s biggest vegetable market — Kalimati Tarkari! This comprehensive dataset 📊 covers vegetable market data from 2013 to 2025(June), perfect for researchers, traders and agri-enthusiasts.

🚀 Data Sources Part of the dataset is downloaded from the official open data portal: 👉 https://opendatanepal.com/dataset/kalimati-tarkari-dataset 📥

The rest of the data is extracted daily via the open-source project maintained here: 👉 https://github.com/ErKiran/kalimati 🔗

🔍 How the Dataset Was Created Both parts are downloaded using a custom Python script 🐍 and a Jupyter Notebook 📓 that automate downloading, merging, and cleaning multiple daily CSV files into one single, clean dataset — ready for analysis. 🧹✨

🌟 Why Use This Dataset? Detailed daily records of vegetable prices and quantities 💰📦

Valuable for market trend analysis, forecasting, and agricultural planning 🚜📈

Ideal for data science and machine learning projects 🤖

🙏 Credits Big thanks to the Open Data Nepal portal and the amazing team behind the extraction project at Github ErKiran for providing the raw data and scripts! 🙌

📝 Notes A few days of data are missing due to unavailable or incomplete daily reports.
O
Renewable power plants
data.open-power-system-data.org
csv, sqlite, xlsx
Updated Mar 8, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ingmar Schlecht (2018). Renewable power plants [Dataset]. http://doi.org/10.25832/renewable_power_plants/2018-03-08
Explore at:
csv, sqlite, xlsxAvailable download formats
Unique identifier
https://doi.org/10.25832/renewable_power_plants/2018-03-08
Dataset updated
Mar 8, 2018
Dataset provided by
Open Power System Data
Authors
Ingmar Schlecht
Variables measured
Solar, Onshore, Offshore, Geothermal, Run-of-river, Bioenergy and renewable waste
Description
List of renewable energy power stations. This Data Package contains a list of renewable energy power plants in lists of renewable energy-based power plants of Germany, Denmark, France and Poland. Germany: More than 1.7 million renewable power plant entries, eligible under the renewable support scheme (EEG). Denmark: Wind and phovoltaic power plants with a high level of detail. France: Aggregated capacity and number of installations per energy source per municipality (Commune). Poland: Summed capacity and number of installations per energy source per municipality (Powiat). Switzerland: Renewable power plants eligible under the Swiss feed in tariff KEV (Kostendeckende Einspeisevergütung) Due to different data availability, the power plant lists are of different accurancy and partly provide different power plant parameter. Due to that, the lists are provided as seperate csv-files per country and as separate sheets in the excel file. Suspect data or entries with high probability of duplication are marked in the column 'comment'. Theses validation markers are explained in the file validation_marker.csv. Filtering all entries with comments out results in the recommended data set. Additionally, the Data Package includes a daily time series of cumulated installed capacity per energy source type for Germany. All data processing is conducted in Python and pandas and has been documented in the Jupyter Notebooks linked below.
O
Renewable power plants
data.open-power-system-data.org
kaggle.com
csv, sqlite, xlsx
Updated Aug 25, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ingmar Schlecht; Milos Simic (2020). Renewable power plants [Dataset]. http://doi.org/10.25832/renewable_power_plants/2020-08-25
Explore at:
csv, sqlite, xlsxAvailable download formats
Unique identifier
https://doi.org/10.25832/renewable_power_plants/2020-08-25
Dataset updated
Aug 25, 2020
Dataset provided by
Open Power System Data
Authors
Ingmar Schlecht; Milos Simic
Time period covered
Jan 2, 1900 - Dec 31, 2019
Variables measured
day, DE_wind_capacity, DK_wind_capacity, SE_wind_capacity, CH_solar_capacity, DE_solar_capacity, DK_solar_capacity, FR_hydro_capacity, FR_solar_capacity, FR_marine_capacity, and 30 more
Description
List of renewable energy power stations. This Data Package contains a list of renewable energy power plants in lists of renewable energy-based power plants of Czechia, Denmark, France, Germany, Poland, Sweden, Switzerland and United Kingdom. Czechia: Renewable-energy power plants in Czech Republic. Denmark: Wind and phovoltaic power plants with a high level of detail. France: Renewable-energy power plants of various types (solar, hydro, wind, bioenergy marine, geothermal) in France. Germany: Individual power plants, all renewable energy plants supported by the German Renewable Energy Law (EEG). Poland: Summed capacity and number of installations per energy source per municipality (Powiat). Sweden: Wind power plants in Sweden. Switzerland: All renewable-energy power plants supported by the feed-in-tariff KEV (Kostendeckende Einspeisevergütung). United Kingdom: Renewable-energy power plants in the United Kingdom. Due to different data availability, the power plant lists are of different accurancy and partly provide different power plant parameter. Due to that, the lists are provided as seperate csv-files per country and as separate sheets in the excel file. Suspect data or entries with high probability of duplication are marked in the column 'comment'. Theses validation markers are explained in the file validation_marker.csv. Additionally, the Data Package includes daily time series of cumulated installed capacity per energy source type for Germany, Denmark, Switzerland, the United Kingdom and Sweden. All data processing is conducted in Python and pandas and has been documented in the Jupyter Notebooks linked below.
Youtube Quality Videos Classification
kaggle.com
zip
Updated Oct 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Youtube Quality Videos Classification [Dataset]. https://www.kaggle.com/datasets/thedevastator/youtube-quality-videos-classification/code
Explore at:
zip(712986 bytes)Available download formats
Dataset updated
Oct 21, 2022
Authors
The Devastator
Area covered
YouTube
Description
Youtube Quality Videos Classification

How to Tell If a Video is Good or Bad

About this dataset

This dataset is important as it can help users find good quality videos more easily. The data was collected using the Youtube API and includes a total of _ videos

Columns: Channel title, view count, like count, comment count, definition, caption, subscribers, total views, average polarity score, label

How to use the dataset

In order to use this dataset, you will need to have the following: -A YouTube API key -A text editor (e.g. Notepad++, Sublime Text, etc.)

Once you have collected these items, you can begin using the dataset. Here is a step-by-step guide: 1) Navigate to the folder where you saved the dataset. 2) Right-click on the file and select Open with > Your text editor. 3) copy your YouTube API key and paste it in place of Your_API_Key in line 4 of the code. 4) Save the file and close your text editor. 5) Navigate to the folder in your terminal/command prompt and type jupyter notebook. This will open a Jupyter Notebook in your browser window.

Research Ideas

This dataset can be used for a number of different things including: 1. Finding good quality videos on youtube 2. Determining which videos are more likely to be reputable 3. Helping people find videos they will enjoy

Acknowledgements

The data for this dataset was collected using the Youtube API and includes a total of _ videos

License

See the dataset description for more information.

Columns

File: dataframeclean.csv | Column name | Description | |:-----------------------|:--------------| | **** | | | channelTitle | | | viewCount | | | likeCount | | | commentCount | | | definition | | | caption | | | subscribers | | | totalViews | | | avg polarity score | | | Label | | | pushblishYear | | | durationSecs | | | tagCount | | | title length | | | description length | |

File: ytdataframe.csv | Column name | Description | |:-----------------------|:------------------------------------------------------------| | **** | | | channelTitle | | | viewCount | | | likeCount | | | commentCount | | | definition | | | caption | | | subscribers | | | totalViews | | | avg polarity score | | | Label | | | title | The title of the video. (String) | | description | A description of the video. (String) | | tags | The tags associated with the video. (String) | | publishedAt | The date and time the video was published. (String) | | favouriteCount | The number of times the video has been favorited. (Integer) | | duration | The length of the video in seconds. (Integer) |

File: ytdataframe2.csv | Column name | Description | |:-----------------------|:------------------------------------------------------------| | **** | | | channelTitle | | | title | The title of the video. (String) | | description | A description of the video. (String) | | tags | The tags associated with the video. (String) | | publishedAt | The date and time the video was published. (String) | | viewCount | | | **...
O
Household Data
data.open-power-system-data.org
csv, sqlite, xlsx
Updated Apr 15, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adrian Minde (2020). Household Data [Dataset]. https://data.open-power-system-data.org/household_data/
Explore at:
xlsx, csv, sqliteAvailable download formats
Dataset updated
Apr 15, 2020
Dataset provided by
Open Power System Data
Authors
Adrian Minde
Time period covered
Dec 11, 2014 - May 1, 2019
Variables measured
interpolated, utc_timestamp, cet_cest_timestamp, DE_KN_industrial2_pv, DE_KN_industrial3_ev, DE_KN_residential1_pv, DE_KN_residential3_pv, DE_KN_residential4_ev, DE_KN_residential4_pv, DE_KN_residential6_pv, and 61 more
Description
Detailed household load and solar generation in minutely to hourly resolution. This data package contains measured time series data for several small businesses and residential households relevant for household- or low-voltage-level power system modeling. The data includes solar power generation as well as electricity consumption (load) in a resolution up to single device consumption. The starting point for the time series, as well as data quality, varies between households, with gaps spanning from a few minutes to entire days. All measurement devices provided cumulative energy consumption/generation over time. Hence overall energy consumption/generation is retained, in case of data gaps due to communication problems. Measurements were conducted 1-minute intervals, with all data made available in an interpolated, uniform and regular time interval. All data gaps are either interpolated linearly, or filled with data of prior days. Additionally, data in 15 and 60-minute resolution is provided for compatibility with other time series data. Data processing is conducted in Jupyter Notebooks/Python/pandas.
Australian Employee Salary/Wages DATAbase by detailed occupation, location...
figshare.com
txt
Updated May 31, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Richard Ferrers; Australian Taxation Office (2023). Australian Employee Salary/Wages DATAbase by detailed occupation, location and year (2002-14); (plus Sole Traders) [Dataset]. http://doi.org/10.6084/m9.figshare.4522895.v5
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.4522895.v5
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Richard Ferrers; Australian Taxation Office
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The ATO (Australian Tax Office) made a dataset openly available (see links) showing all the Australian Salary and Wages (2002, 2006, 2010, 2014) by detailed occupation (around 1,000) and over 100 SA4 regions. Sole Trader sales and earnings are also provided. This open data (csv) is now packaged into a database (*.sql) with 45 sample SQL queries (backupSQL[date]_public.txt).See more description at related Figshare #datavis record. Versions:V5: Following #datascience course, I have made main data (individual salary and wages) available as csv and Jupyter Notebook. Checksum matches #dataTotals. In 209,xxx rows.Also provided Jobs, and SA4(Locations) description files as csv. More details at: Where are jobs growing/shrinking? Figshare DOI: 4056282 (linked below). Noted 1% discrepancy ($6B) in 2010 wages total - to follow up.#dataTotals - Salary and WagesYearWorkers (M)Earnings ($B) 20028.528520069.4372201010.2481201410.3584#dataTotal - Sole TradersYearWorkers (M)Sales ($B)Earnings ($B)20020.9611320061.0881920101.11122620141.19630#links See ATO request for data at ideascale link below.See original csv open data set (CC-BY) at data.gov.au link below.This database was used to create maps of change in regional employment - see Figshare link below (m9.figshare.4056282).#packageThis file package contains a database (analysing the open data) in SQL package and sample SQL text, interrogating the DB. DB name: test. There are 20 queries relating to Salary and Wages.#analysisThe database was analysed and outputs provided on Nectar(.org.au) resources at: http://118.138.240.130.(offline)This is only resourced for max 1 year, from July 2016, so will expire in June 2017. Hence the filing here. The sample home page is provided here (and pdf), but not all the supporting files, which may be packaged and added later. Until then all files are available at the Nectar URL. Nectar URL now offline - server files attached as package (html_backup[date].zip), including php scripts, html, csv, jpegs.#installIMPORT: DB SQL dump e.g. test_2016-12-20.sql (14.8Mb)1.Started MAMP on OSX.1.1 Go to PhpMyAdmin2. New Database: 3. Import: Choose file: test_2016-12-20.sql -> Go (about 15-20 seconds on MacBookPro 16Gb, 2.3 Ghz i5)4. four tables appeared: jobTitles 3,208 rows | salaryWages 209,697 rows | soleTrader 97,209 rows | stateNames 9 rowsplus views e.g. deltahair, Industrycodes, states5. Run test query under **#; Sum of Salary by SA4 e.g. 101 $4.7B, 102 $6.9B#sampleSQLselect sa4,(select sum(count) from salaryWageswhere year = '2014' and sa4 = sw.sa4) as thisYr14,(select sum(count) from salaryWageswhere year = '2010' and sa4 = sw.sa4) as thisYr10,(select sum(count) from salaryWageswhere year = '2006' and sa4 = sw.sa4) as thisYr06,(select sum(count) from salaryWageswhere year = '2002' and sa4 = sw.sa4) as thisYr02from salaryWages swgroup by sa4order by sa4
Replication package for the paper accepted at Springer's EMSE Journal:...
zenodo.org
data.niaid.nih.gov
+1more
zip
Updated Jul 8, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert Heumüller; Robert Heumüller; Sebastian Nielebock; Sebastian Nielebock; Jacob Krüger; Jacob Krüger; Frank Ortmeier; Frank Ortmeier (2020). Replication package for the paper accepted at Springer's EMSE Journal: Publish or Perish - But do not Forget your Software Artifacts [Dataset]. http://doi.org/10.5281/zenodo.3925742
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3925742
Dataset updated
Jul 8, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Robert Heumüller; Robert Heumüller; Sebastian Nielebock; Sebastian Nielebock; Jacob Krüger; Jacob Krüger; Frank Ortmeier; Frank Ortmeier
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the replication package for the paper "Publish or Perish - But do not Forget your Software Artifacts", accepted at Springer's EMSE Journal in June 2020.

It contains:

A ReadMe file with instructions on how to use the replication scripts

The complete, labeled dataset of 792 ICSE papers as CSV

All the python scripts that we used for data-acquisition and -preparation

The jupyter notebook that we used for the evaluation. This also contains some additional analyses which are not included in the paper.

An html export of the notebook for quick reference

A folder containing all of the diagrams in pdf form
Speedtest Open Data - Australia 2020 Q2, Q3, Q4 extract
figshare.com
txt
Updated Oct 24, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Richard Ferrers; Speedtest Global Index (2025). Speedtest Open Data - Australia 2020 Q2, Q3, Q4 extract [Dataset]. http://doi.org/10.6084/m9.figshare.13370504.v17
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.13370504.v17
Dataset updated
Oct 24, 2025
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Richard Ferrers; Speedtest Global Index
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Australia
Description
This is an Australian extract of Speedtest Open data available at Amazon WS (link below - opendata.aws).AWS data licence is "CC BY-NC-SA 4.0", so use of this data must be:- non-commercial (NC)- reuse must be share-alike (SA)(add same licence).This restricts the standard CC-BY Figshare licence.A world speedtest open data was dowloaded (>400Mb, 7M lines of data). An extract of Australia's location (lat, long) revealed 88,000 lines of data (attached as csv).A Jupyter notebook of extract process is attached.A link to Twitter thread of outputs provided.A link to Data tutorial provided (GitHub), including Jupyter Notebook to analyse World Speedtest data, selecting one US State.Data Shows: (Q2)- 3.1M speedtests- 762,000 devices- 88,000 grid locations (600m * 600m), summarised as a point- average speed 33.7Mbps (down), 12.4M (up)- Max speed 724Mbps- data is for 600m * 600m grids, showing average speed up/down, number of tests, and number of users (IP). Added centroid, and now lat/long.See tweet of image of centroids also attached.Versions:v15/16. Add Hist comparing Q1-21 vs Q2-20. Inc ipynb (incHistQ121, v.1.3-Q121) to calc.v14 Add AUS Speedtest Q1 2021 geojson.(79k lines avg d/l 45.4Mbps)v13 - Added three colour MELB map (less than 20Mbps, over 90Mbps, 20-90Mbps)v12 - Added AUS - Syd - Mel Line Chart Q320.v11 - Add line chart compare Q2, Q3, Q4 plus Melb - result virtually indistinguishable. Add line chart to compare Syd - Melb Q3. Also virtually indistinguishable. Add HIST compare Syd - Melb Q3. Add new Jupyter with graph calcs (nbn-AUS-v1.3). Some ERRATA document in Notebook. Issue with resorting table, and graphing only part of table. Not an issue if all lines of table graphed.v10 - Load AURIN sample pics. Speedtest data loaded to AURIN geo-analytic platform; requires edu.au login.v9 - Add comparative Q2, Q3, Q4 Hist pic.v8 - Added Q4 data geojson. Add Q3, Q4 Hist pic.v7 - Rename to include Q2, Q3 in Title.v6 - Add Q3 20 data. Rename geojson AUS data as Q2. Add comparative Histogram. Calc in International.ipynb.v5 - add Jupyter Notebook inc Histograms. Hist is count of geo-locations avg download speed (unweighted by tests).v4 - added Melb choropleth (png 50Mpix) inc legend. (To do - add Melb.geojson). Posted Link to AURIN description of Speedtest data.v3 - Add super fast data (>100Mbps) less than 1% of data - 697 lines. Includes png of superfast.plot(). Link below to Google Maps version of superfast data points. Also Google map of first 100 data points - sample data. Geojson format for loading into GeoPandas, per Jupyter Notebook. New version of Jupyter Notebook, v.1.1.v2 - add centroids image.v1 - initial data load.** Future Work- combine Speedtest data with NBN Technology by location data (national map.gov.au); https://www.data.gov.au/dataset/national-broadband-network-connections-by-technology-type- combine Speedtest data with SEIFA data - socioeconomic categories - to discuss with AURIN.- Further international comparisons- discussed collaboration with Assoc Prof Tooran Alizadeh, USyd.

Facebook

Twitter

Click to copy link

Link copied

Cite

Richard Ferrers; Speedtest Global Index (2023). Speedtest Open Data - Four International cities - MEL, BKK, SHG, LAX plus ALC - 2020, 2022 [Dataset]. http://doi.org/10.6084/m9.figshare.13621169.v24

Speedtest Open Data - Four International cities - MEL, BKK, SHG, LAX plus ALC - 2020, 2022

Explore at:

txtAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.13621169.v24

Dataset updated

May 30, 2023

Dataset provided by

Figsharehttp://figshare.com/

Authors

Richard Ferrers; Speedtest Global Index

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset compares four cities FIXED-line broadband internet speeds: - Melbourne, AU - Bangkok, TH - Shanghai, CN - Los Angeles, US - Alice Springs, AU

ERRATA: 1.Data is for Q3 2020, but some files are labelled incorrectly as 02-20 of June 20. They all should read Sept 20, or 09-20 as Q3 20, rather than Q2. Will rename and reload. Amended in v7.

LAX file named 0320, when should be Q320. Amended in v8.

*lines of data for each geojson file; a line equates to a 600m^2 location, inc total tests, devices used, and average upload and download speed - MEL 16181 locations/lines => 0.85M speedtests (16.7 tests per 100people) - SHG 31745 lines => 0.65M speedtests (2.5/100pp) - BKK 29296 lines => 1.5M speedtests (14.3/100pp) - LAX 15899 lines => 1.3M speedtests (10.4/100pp) - ALC 76 lines => 500 speedtests (2/100pp)

Geojsons of these 2* by 2* extracts for MEL, BKK, SHG now added, and LAX added v6. Alice Springs added v15.

This dataset unpacks, geospatially, data summaries provided in Speedtest Global Index (linked below). See Jupyter Notebook (*.ipynb) to interrogate geo data. See link to install Jupyter.

** To Do Will add Google Map versions so everyone can see without installing Jupyter. - Link to Google Map (BKK) added below. Key:Green > 100Mbps(Superfast). Black > 500Mbps (Ultrafast). CSV provided. Code in Speedtestv1.1.ipynb Jupyter Notebook. - Community (Whirlpool) surprised [Link: https://whrl.pl/RgAPTl] that Melb has 20% at or above 100Mbps. Suggest plot Top 20% on map for community. Google Map link - now added (and tweet).

** Python melb = au_tiles.cx[144:146 , -39:-37] #Lat/Lon extract shg = tiles.cx[120:122 , 30:32] #Lat/Lon extract bkk = tiles.cx[100:102 , 13:15] #Lat/Lon extract lax = tiles.cx[-118:-120, 33:35] #lat/Lon extract ALC=tiles.cx[132:134, -22:-24] #Lat/Lon extract

Histograms (v9), and data visualisations (v3,5,9,11) will be provided. Data Sourced from - This is an extract of Speedtest Open data available at Amazon WS (link below - opendata.aws).

**VERSIONS v.24 Add tweet and google map of Top 20% (over 100Mbps locations) in Mel Q322. Add v.1.5 MEL-Superfast notebook, and CSV of results (now on Google Map; link below). v23. Add graph of 2022 Broadband distribution, and compare 2020 - 2022. Updated v1.4 Jupyter notebook. v22. Add Import ipynb; workflow-import-4cities. v21. Add Q3 2022 data; five cities inc ALC. Geojson files. (2020; 4.3M tests 2022; 2.9M tests)

Melb 14784 lines Avg download speed 69.4M Tests 0.39M

SHG 31207 lines Avg 233.7M Tests 0.56M

ALC 113 lines Avg 51.5M Test 1092

BKK 29684 lines Avg 215.9M Tests 1.2M

LAX 15505 lines Avg 218.5M Tests 0.74M

v20. Speedtest - Five Cities inc ALC. v19. Add ALC2.ipynb. v18. Add ALC line graph. v17. Added ipynb for ALC. Added ALC to title.v16. Load Alice Springs Data Q221 - csv. Added Google Map link of ALC. v15. Load Melb Q1 2021 data - csv. V14. Added Melb Q1 2021 data - geojson. v13. Added Twitter link to pics. v12 Add Line-Compare pic (fastest 1000 locations) inc Jupyter (nbn-intl-v1.2.ipynb). v11 Add Line-Compare pic, plotting Four Cities on a graph. v10 Add Four Histograms in one pic. v9 Add Histogram for Four Cities. Add NBN-Intl.v1.1.ipynb (Jupyter Notebook). v8 Renamed LAX file to Q3, rather than 03. v7 Amended file names of BKK files to correctly label as Q3, not Q2 or 06. v6 Added LAX file. v5 Add screenshot of BKK Google Map. v4 Add BKK Google map(link below), and BKK csv mapping files. v3 replaced MEL map with big key version. Prev key was very tiny in top right corner. v2 Uploaded MEL, SHG, BKK data and Jupyter Notebook v1 Metadata record

** LICENCE AWS data licence on Speedtest data is "CC BY-NC-SA 4.0", so use of this data must be: - non-commercial (NC) - reuse must be share-alike (SA)(add same licence). This restricts the standard CC-BY Figshare licence.

** Other uses of Speedtest Open Data; - see link at Speedtest below.

Clear search

Close search

Google apps

Main menu

Speedtest Open Data - Four International cities - MEL, BKK, SHG, LAX plus...

Melb 14784 lines Avg download speed 69.4M Tests 0.39M

SHG 31207 lines Avg 233.7M Tests 0.56M

ALC 113 lines Avg 51.5M Test 1092

BKK 29684 lines Avg 215.9M Tests 1.2M

LAX 15505 lines Avg 218.5M Tests 0.74M

Amazon Web Scrapping Dataset

Sample Park Analysis

OpenOrca

Open-Orca Augmented FLAN Dataset

Unlocking Advanced Language Understanding and ML Model Performance

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Galaxy Training Material for the 'Use Jupyter notebooks in Galaxy' tutorial

synthetic_credit_card_default

Cognitive Fatigue

Open University Learning Analytics Dataset

Open University Learning Analytics Dataset

Student Performance and Engagement Data at The Open University

About this dataset

How to use the dataset

How To Use This Dataset:

Articles metadata from CrossRef

See: https://www.kaggle.com/datasets/keakohv/data-citation-coprus-v4-1-eupmc-and-datacite

Load the citation pairs from the Parquet file

Remove all rows where https is in the 'publication' column but no "doi.org" is present

Remove all rows where figshare is in the dataset name

Save list articles to a file

Cleaned Contoso Dataset

The S&M-HSTPM2d5 dataset: High Spatial-Temporal Resolution PM 2.5 Measures...

Student Performance and Learning Behavior Dataset for Educational Analytics

Kalimati Tarkari Dataset Updated

Renewable power plants

Renewable power plants

Youtube Quality Videos Classification

Youtube Quality Videos Classification

How to Tell If a Video is Good or Bad

About this dataset

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Household Data

Australian Employee Salary/Wages DATAbase by detailed occupation, location...

Replication package for the paper accepted at Springer's EMSE Journal:...

Speedtest Open Data - Australia 2020 Q2, Q3, Q4 extract

Speedtest Open Data - Four International cities - MEL, BKK, SHG, LAX plus ALC - 2020, 2022

Melb 14784 lines Avg download speed 69.4M Tests 0.39M

SHG 31207 lines Avg 233.7M Tests 0.56M

ALC 113 lines Avg 51.5M Test 1092

BKK 29684 lines Avg 215.9M Tests 1.2M

LAX 15505 lines Avg 218.5M Tests 0.74M