Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Alinaghi, N., Giannopoulos, I., Kattenbeck, M., & Raubal, M. (2025). Decoding wayfinding: analyzing wayfinding processes in the outdoor environment. International Journal of Geographical Information Science, 1–31. https://doi.org/10.1080/13658816.2025.2473599
Link to the paper: https://www.tandfonline.com/doi/full/10.1080/13658816.2025.2473599
The folder named “submission” contains the following:
ijgis.yml
: This file lists all the Python libraries and dependencies required to run the code.ijgis.yml
file to create a Python project and environment. Ensure you activate the environment before running the code.pythonProject
folder contains several .py
files and subfolders, each with specific functionality as described below..png
file for each column of the raw gaze and IMU recordings, color-coded with logged events..csv
files.overlapping_sliding_window_loop.py
.plot_labels_comparison(df, save_path, x_label_freq=10, figsize=(15, 5))
in line 116 visualizes the data preparation results. As this visualization is not used in the paper, the line is commented out, but if you want to see visually what has been changed compared to the original data, you can comment out this line..csv
files in the results folder.This part contains three main code blocks:
iii. One for the XGboost code with correct hyperparameter tuning:
Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically test the confidence threshold of
Note: Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically calculated the confidence threshold of the model (explained in the paper in Section 5.2. Part II: Decoding surveillance by sequence analysis) is given in this block in lines 361 to 380.
.csv
file containing inferred labels.The data is licensed under CC-BY, the code is licensed under MIT.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Copies of Anaconda 3 Jupyter Notebooks and Python script for holistic and clustered analysis of "The Impact of COVID-19 on Technical Services Units" survey results. Data was analyzed holistically using cleaned and standardized survey results and by library type clusters. To streamline data analysis in certain locations, an off-shoot CSV file was created so data could be standardized without compromising the integrity of the parent clean file. Three Jupyter Notebooks/Python scripts are available in relation to this project: COVID_Impact_TechnicalServices_HolisticAnalysis (a holistic analysis of all survey data) and COVID_Impact_TechnicalServices_LibraryTypeAnalysis (a clustered analysis of impact by library type, clustered files available as part of the Dataverse for this project).
This is a dataset I extracted from UK job posts in London, from reed.co.uk. These jobs have the keyword 'data' in them. I extracted the data using python. I created a loop to extract over 400 pages, allowing me to scrape over 10,000 job posts.
CodeParrot 🦜 Dataset Cleaned
What is it?
A dataset of Python files from Github. This is the deduplicated version of the codeparrot.
Processing
The original dataset contains a lot of duplicated and noisy data. Therefore, the dataset was cleaned with the following steps:
Deduplication Remove exact matches
Filtering Average line length < 100 Maximum line length < 1000 Alpha numeric characters fraction > 0.25 Remove auto-generated files (keyword search)
For… See the full description on the dataset page: https://huggingface.co/datasets/codeparrot/codeparrot-clean.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
python scripts and functions needed to view and clean saccade data
The dataset is gathered on Sep. 17th 2020 from GitHub. It has more than 5.2K Python repositories and 4.2M type annotations. The dataset is also de-duplicated using the CD4Py tool. Check out the README.MD file for the description of the dataset. Notable changes to each version of the dataset are documented in CHANGELOG.md. The dataset's scripts and utilities are available on its GitHub repository.
The EPA GitHub repository PAU4ChemAs as described in the README.md file, contains Python scripts written to build the PAU dataset modules (technologies, capital and operating costs, and chemical prices) for tracking chemical flows transfers, releases estimation, and identification of potential occupation exposure scenarios in pollution abatement units (PAUs). These PAUs are employed for on-site chemical end-of-life management. The folder datasets contains the outputs for each framework step. The Chemicals_in_categories.csv contains the chemicals for the TRI chemical categories. The EPA GitHub repository PAU_case_study as described in its readme.md entry, contains the Python scripts to run the manuscript case study for designing the PAUs, the data-driven models, and the decision-making module for chemicals of concern and tracking flow transfers at the end-of-life stage. The data was obtained by means of data engineering using different publicly-available databases. The properties of chemicals were obtained using the GitHub repository Properties_Scraper, while the PAU dataset using the repository PAU4Chem. Finally, the EPA GitHub repository Properties_Scraper contains a Python script to massively gather information about exposure limits and physical properties from different publicly-available sources: EPA, NOAA, OSHA, and the institute for Occupational Safety and Health of the German Social Accident Insurance (IFA). Also, all GitHub repositories describe the Python libraries required for running their code, how to use them, the obtained outputs files after running the Python script modules, and the corresponding EPA Disclaimer. This dataset is associated with the following publication: Hernandez-Betancur, J.D., M. Martin, and G.J. Ruiz-Mercado. A data engineering framework for on-site end-of-life industrial operations. JOURNAL OF CLEANER PRODUCTION. Elsevier Science Ltd, New York, NY, USA, 327: 129514, (2021).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This folder contains the full set of code and data for the CompuCrawl database. The database contains the archived websites of publicly traded North American firms listed in the Compustat database between 1996 and 2020\u2014representing 11,277 firms with 86,303 firm/year observations and 1,617,675 webpages in the final cleaned and selected set.The files are ordered by moment of use in the work flow. For example, the first file in the list is the input file for code files 01 and 02, which create and update the two tracking files "scrapedURLs.csv" and "URLs_1_deeper.csv" and which write HTML files to its folder. "HTML.zip" is the resultant folder, converted to .zip for ease of sharing. Code file 03 then reads this .zip file and is therefore below it in the ordering.The full set of files, in order of use, is as follows:Compustat_2021.xlsx: The input file containing the URLs to be scraped and their date range.01 Collect frontpages.py: Python script scraping the front pages of the list of URLs and generating a list of URLs one page deeper in the domains.URLs_1_deeper.csv: List of URLs one page deeper on the main domains.02 Collect further pages.py: Python script scraping the list of URLs one page deeper in the domains.scrapedURLs.csv: Tracking file containing all URLs that were accessed and their scraping status.HTML.zip: Archived version of the set of individual HTML files.03 Convert HTML to plaintext.py: Python script converting the individual HTML pages to plaintext.TXT_uncleaned.zip: Archived version of the converted yet uncleaned plaintext files.input_categorization_allpages.csv: Input file for classification of pages using GPT according to their HTML title and URL.04 GPT application.py: Python script using OpenAI\u2019s API to classify selected pages according to their HTML title and URL.categorization_applied.csv: Output file containing classification of selected pages.exclusion_list.xlsx: File containing three sheets: 'gvkeys' containing the GVKEYs of duplicate observations (that need to be excluded), 'pages' containing page IDs for pages that should be removed, and 'sentences' containing (sub-)sentences to be removed.05 Clean and select.py: Python script applying data selection and cleaning (including selection based on page category), with setting and decisions described at the top of the script. This script also combined individual pages into one combined observation per GVKEY/year.metadata.csv: Metadata containing information on all processed HTML pages, including those not selected.TXT_cleaned.zip: Archived version of the selected and cleaned plaintext page files. This file serves as input for the word embeddings application.TXT_combined.zip: Archived version of the combined plaintext files at the GVKEY/year level. This file serves as input for the data description using topic modeling.06 Topic model.R: R script that loads up the combined text data from the folder stored in "TXT_combined.zip", applies further cleaning, and estimates a 125-topic model.TM_125.RData: RData file containing the results of the 125-topic model.loadings125.csv: CSV file containing the loadings for all 125 topics for all GVKEY/year observations that were included in the topic model.125_topprob.xlsx: Overview of top-loading terms for the 125 topic model.07 Word2Vec train and align.py: Python script that loads the plaintext files in the "TXT_cleaned.zip" archive to train a series of Word2Vec models and subsequently align them in order to compare word embeddings across time periods.Word2Vec_models.zip: Archived version of the saved Word2Vec models, both unaligned and aligned.08 Word2Vec work with aligned models.py: Python script which loads the trained Word2Vec models to trace the development of the embeddings for the terms \u201csustainability\u201d and \u201cprofitability\u201d over time.99 Scrape further levels down.py: Python script that can be used to generate a list of unscraped URLs from the pages that themselves were one level deeper than the front page.URLs_2_deeper.csv: CSV file containing unscraped URLs from the pages that themselves were one level deeper than the front page.For those only interested in downloading the final database of texts, the files "HTML.zip", "TXT_uncleaned.zip", "TXT_cleaned.zip", and "TXT_combined.zip" contain the full set of HTML pages, the processed but uncleaned texts, the selected and cleaned texts, and combined and cleaned texts at the GVKEY/year level, respectively.The following webpage contains answers to frequently asked questions: https://haans-mertens.github.io/faq/. More information on the database and the underlying project can be found here: https://haans-mertens.github.io/ and the following article: \u201cThe Internet Never Forgets: A Four-Step Scraping Tutorial, Codebase, and Database for Longitudinal Organizational Website Data\u201d, by Richard F.J. Haans and Marc J. Mertens in Organizational Research Methods. The full paper can be accessed here.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Part of the dissertation Pitch of Voiced Speech in the Short-Time Fourier Transform: Algorithms, Ground Truths, and Evaluation Methods.
© 2020, Bastian Bechtold. All rights reserved.
Estimating the fundamental frequency of speech remains an active area of research, with varied applications in speech recognition, speaker identification, and speech compression. A vast number of algorithms for estimatimating this quantity have been proposed over the years, and a number of speech and noise corpora have been developed for evaluating their performance. The present dataset contains estimated fundamental frequency tracks of 25 algorithms, six speech corpora, two noise corpora, at nine signal-to-noise ratios between -20 and 20 dB SNR, as well as an additional evaluation of synthetic harmonic tone complexes in white noise.
The dataset also contains pre-calculated performance measures both novel and traditional, in reference to each speech corpus’ ground truth, the algorithms’ own clean-speech estimate, and our own consensus truth. It can thus serve as the basis for a comparison study, or to replicate existing studies from a larger dataset, or as a reference for developing new fundamental frequency estimation algorithms. All source code and data is available to download, and entirely reproducible, albeit requiring about one year of processor-time.
Included Code and Data
ground truth data.zip
is a JBOF dataset of fundamental frequency estimates and ground truths of all speech files in the following corpora:
noisy speech data.zip
is a JBOF datasets of fundamental frequency estimates of speech files mixed with noise from the following corpora:
synthetic speech data.zip
is a JBOF dataset of fundamental frequency estimates of synthetic harmonic tone complexes in white noise.noisy_speech.pkl
and synthetic_speech.pkl
are pickled Pandas dataframes of performance metrics derived from the above data for the following list of fundamental frequency estimation algorithms:
noisy speech evaluation.py
and synthetic speech evaluation.py
are Python programs to calculate the above Pandas dataframes from the above JBOF datasets. They calculate the following performance measures:
Pipfile
is a pipenv-compatible pipfile for installing all prerequisites necessary for running the above Python programs.The Python programs take about an hour to compute on a fast 2019 computer, and require at least 32 Gb of memory.
References:
Data set consisting of data joined for analyzing the SBIR/STTR program. Data consists of individual awards and agency-level observations. The R and python code required for pulling, cleaning, and creating useful data sets has been included. Allard_Get and Clean Data.R This file provides the code for getting, cleaning, and joining the numerous data sets that this project combined. This code is written in the R language and can be used in any R environment running R 3.5.1 or higher. If the other files in this Dataverse are downloaded to the working directory, then this Rcode will be able to replicate the original study without needing the user to update any file paths. Allard SBIR STTR WebScraper.py This is the code I deployed to multiple Amazon EC2 instances to scrape data o each individual award in my data set, including the contact info and DUNS data. Allard_Analysis_APPAM SBIR project Forthcoming Allard_Spatial Analysis Forthcoming Awards_SBIR_df.Rdata This unique data set consists of 89,330 observations spanning the years 1983 - 2018 and accounting for all eleven SBIR/STTR agencies. This data set consists of data collected from the Small Business Administration's Awards API and also unique data collected through web scraping by the author. Budget_SBIR_df.Rdata 246 observations for 20 agencies across 25 years of their budget-performance in the SBIR/STTR program. Data was collected from the Small Business Administration using the Annual Reports Dashboard, the Awards API, and an author-designed web crawler of the websites of awards. Solicit_SBIR-df.Rdata This data consists of observations of solicitations published by agencies for the SBIR program. This data was collected from the SBA Solicitations API. Primary Sources Small Business Administration. “Annual Reports Dashboard,” 2018. https://www.sbir.gov/awards/annual-reports. Small Business Administration. “SBIR Awards Data,” 2018. https://www.sbir.gov/api. Small Business Administration. “SBIR Solicit Data,” 2018. https://www.sbir.gov/api.
Overview This repository contains ready-to-use frequency time series as well as the corresponding pre-processing scripts in python. The data covers three synchronous areas of the European power grid:
Continental Europe
Great Britain
Nordic
This work is part of the paper "Predictability of Power Grid Frequency"[1]. Please cite this paper, when using the data and the code. For a detailed documentation of the pre-processing procedure we refer to the supplementary material of the paper.
Data sources We downloaded the frequency recordings from publically available repositories of three different Transmission System Operators (TSOs).
Continental Europe [2]: We downloaded the data from the German TSO TransnetBW GmbH, which retains the Copyright on the data, but allows to re-publish it upon request [3].
Great Britain [4]: The download was supported by National Grid ESO Open Data, which belongs to the British TSO National Grid. They publish the frequency recordings under the NGESO Open License [5].
Nordic [6]: We obtained the data from the Finish TSO Fingrid, which provides the data under the open license CC-BY 4.0 [7].
Content of the repository
A) Scripts
In the "Download_scripts" folder you will find three scripts to automatically download frequency data from the TSO's websites.
In "convert_data_format.py" we save the data with corrected timestamp formats. Missing data is marked as NaN (processing step (1) in the supplementary material of [1]).
In "clean_corrupted_data.py" we load the converted data and identify corrupted recordings. We mark them as NaN and clean some of the resulting data holes (processing step (2) in the supplementary material of [1]).
The python scripts run with Python 3.7 and with the packages found in "requirements.txt".
B) Yearly converted and cleansed data The folders "_converted" contain the output of "convert_data_format.py" and "_cleansed" contain the output of "clean_corrupted_data.py".
File type: The files are zipped csv-files, where each file comprises one year.
Data format: The files contain two columns. The second column contains the frequency values in Hz. The first one represents the time stamps in the format Year-Month-Day Hour-Minute-Second, which is given as naive local time. The local time refers to the following time zones and includes Daylight Saving Times (python time zone in brackets):
TransnetBW: Continental European Time (CE)
Nationalgrid: Great Britain (GB)
Fingrid: Finland (Europe/Helsinki)
NaN representation: We mark corrupted and missing data as "NaN" in the csv-files.
Use cases We point out that this repository can be used in two different was:
Use pre-processed data: You can directly use the converted or the cleansed data. Note however, that both data sets include segments of NaN-values due to missing and corrupted recordings. Only a very small part of the NaN-values were eliminated in the cleansed data to not manipulate the data too much.
Produce your own cleansed data: Depending on your application, you might want to cleanse the data in a custom way. You can easily add your custom cleansing procedure in "clean_corrupted_data.py" and then produce cleansed data from the raw data in "_converted".
License
This work is licensed under multiple licenses, which are located in the "LICENSES" folder.
We release the code in the folder "Scripts" under the MIT license .
The pre-processed data in the subfolders "**/Fingrid" and "**/Nationalgrid" are licensed under CC-BY 4.0.
TransnetBW originally did not publish their data under an open license. We have explicitly received the permission to publish the pre-processed version from TransnetBW. However, we cannot publish our pre-processed version under an open license due to the missing license of the original TransnetBW data.
Changelog Version 2:
Add time zone information to description
Include new frequency data
Update references
Change folder structure to yearly folders
Version 3:
Correct TransnetBW files for missing data in May 2016
The purpose of this data release is to provide data in support of the Bureau of Land Management's (BLM) Reasonably Foreseeable Development (RFD) Scenario by estimating water-use associated with oil and gas extraction methods within the BLM Carlsbad Field Office (CFO) planning area, located in Eddy and Lea Counties as well as part of Chaves County, New Mexico. Three comma separated value files and two python scripts are included in this data release. It was determined that all reported oil and gas wells within Chaves County from the FracFocus and New Mexico Oil Conservation Division (NM OCD) databases were outside of the CFO administration area and were excluded from well_records.csv and modeled_estimates.csv. Data from Chaves County are included in the produced_water.csv file to be consistent with the BLM’s water support document. Data were synthesized into comma separated values which include, produced_water.csv (volume) from NM OCD, well_records.csv (including location and completion) from NM OCD and FracFocus, and modeled_estimates.csv (using FracFocus as well as Ball and others (2020) as input data). The results from modeled_estimates.csv were obtained using a previously published regression model (McShane and McDowell, 2021) to estimate water use associated with unconventional oil and gas activities in the Permian Basin (Valder and others, 2021) for the period of interest (2010-2021). Additionally, python scripts to process, clean, and categorize FracFocus data are provided in this data release.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
GovReport Summarization - 8192 tokens
ccdv/govreport-summarization with the changes of: data cleaned with the clean-text python package total tokens for each column computed and added in new columns according to the long-t5 tokenizer (done after cleaning)
train info
RangeIndex: 8200 entries, 0 to 8199 Data columns (total 4 columns): # Column Non-Null Count Dtype
0 report 8200 non-null… See the full description on the dataset page: https://huggingface.co/datasets/pszemraj/govreport-summarization-8192.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset is gathered on Sep. 17th 2020 from GitHub. It has clean and complete versions (from v0.7): The clean version has 5.1K type-checked Python repositories and 1.2M type annotations. The complete version has 5.2K Python repositories and 3.3M type annotations. The dataset's source files are type-checked using mypy (clean version). The dataset is also de-duplicated using the CD4Py tool. Check out the README.MD file for the description of the dataset. Notable changes to each version of the dataset are documented in CHANGELOG.md. The dataset's scripts and utilities are available on its GitHub repository.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
To gather the dataset, we asked two participants to perform six basic knife activities. The layout of the system experiment is provided in Fig. 4. As it illustrates, we put the receiver on the right side and the ESP32 transceiver on the left side of the performing area. The performing area is a cutting board (30 x 46 cm) in this experiment. Each participant performs each activity five times in the performing area. The data is recorded using a customized version of ESP32-CSI-tool [38] on the laptop that helps us to record and save each data in a separate file. After recording all 60 data entries, we used Python code to extract the clean data from all generated text by the tool. The clean data is stored in a database and creates the dataset.
Three Cases: Metadata and ProceduresThe data sets described here were used in an article submitted to the journal GeoHealth in 2021. The data files and further supplemental links (including general information about GLOBE data) can be accessed at https://observer.globe.gov/get-data/mosquito-habitat-data.Case 1: Removal of records with suspect geolocation data. A Python script was applied to remove records where the measured position (in decimal degrees) was identical to the GLOBE MGRS site position. GPS-obtained latitude and longitude coordinates are reported in decimal degrees, so records identified by whole numbers were also removed. This procedure removed 5704 (23%) of the 24983 records in the Mosquito Habitat Mapper database, with 19,279 records remaining. The secondary data sets cleaned only for geolocation anomalies were labeled Case 1.Case 2: Identifying suspected training events. For this test, we sought to identify groups of data that exceeded 10 records sharing these characteristics. Another Python script was employed to extract the photos for ease of visual inspection. Because we needed to manually review the photo records, we set the threshold for groups at >10, so that the analysis could be completed in the time allotted. Groups identified thought this procedure were outputted as case 2: groups. The resulting data set cleaned of groups >10 was labeled Case 2. The resulting data set included 20,006 records and identified 2,447 records found in clusters we postulated were training events.Case 3: The Case 3 secondary dataset result from applying the Python scripts used to create Cases 1 and 2. We used the Case 3 data sets, with improved geolocation and large groups eliminated, in the following analysis.The information in this description was last updated 2021-04-12
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The repository contains data on party strength for each state as shown on each state's corresponding party strength Wikipedia page (for example, here is Virginia )
Each state has a table of a detailed summary of the state of its governing and representing bodies on Wikipedia but there is no data set that collates these entries. I scraped each state's Wikipedia table and collated the entries into a single dataset. The data are stored in the state_party_strength.csv and state_party_strength_cleaned.csv. The code that generated the file can be found in corresponding Python notebooks.
The data contain information from 1980 on each state's: 1. governor and party 2. state house and senate composition 3. state representative composition in congress 4. electoral votes
Data in the clean version has been cleaned and processed substantially. Namely: - all columns now contain homogenous data within the column - names and Wiki-citations have been removed - only the party counts and party identification have been left The notebook that created this file is here
The data contained herein have not been altered from their Wikipedia tables except in two instances: - Forced column names to be in accord across states - Any needed data modifications (ie concatenated string columns) to retain information when combining columns
Please note that the right encoding for the dataset is "ISO-8859-1", not 'utf-8' though in future versions I will try to fix that to make it more accessible.
This means that you will likely have to perform further data wrangling prior to doing any substantive analysis. The notebook that has been used to create this data file is located here
The raw scraped data can be found in the pickle. This file contains a Python dictionary where each key is a US state name and each element is the raw scraped table in Pandas DataFrame format.
Hope it proves as useful to you in analyzing/using political patterns at the state level in the US for political and policy research.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains the experimental data collected from thermocouples positioned inside a dual cavity solar receiver, used to demonstrate and evaluate the melting of zinc metal using only concentrated solar power as heat input. More information pertaining to the thermocouple positioning and the receiver design can be found in the thesis titled "Testing and development of a solar-dish cavity receiver for the melting of zinc metal". Experiments were conducted, each with a unique set of environmental conditions:
Experiment 1 – 26th of July 2022 = “Exp 1_26072022” Experiment 2 – 04t of August 2022= “Exp 2_04082022” Experiment 3 – 16th of August 2022= “Exp 3_16082022” Experiment 4 – 21st of August 2022= “Exp 4_21082022” Experiment 5 – 5th of September 2022= “Exp 5_05092022”
Also included in the dataset are the original weather data collected on the respective experimental test work days as well as the weather data in the processed form after correcting the weather data to serve as input for the numerical model developed in the Python coding language. Raw weather data:
Exp 1_Weather data_Original_26072022 Exp 2_Weather data_Original_04082022 Exp 3_Weather data_Original_16082022 Exp 4_Weather data_Original_21082022 Exp 5_Weather data_Original_05092022
Processed weather data:
Exp 1_Weather data_Post-process_26072022 Exp 2_Weather data_Post-process_04082022 Exp 3_Weather data_Post-process_16082022 Exp 4_Weather data_Post-process_21082022 Exp 5_Weather data_Post-process_05092022
In addition to all the weather data and the experimental results collected on the five experimental runs, the dataset also contains the Python code used to predict the zinc temperature in the cavity receiver. The code was compiled in Jupyter Notebook and the files consist of the heat loss calculations and zinc temperature prediction for each experimental run. The code contained has been validated against the experimental data and has been demonstrated to have a mean absolute percentage error (MAPE) of 2.7%. The code can thus be used to within 2.7% accuracy predict the zinc temperature inside a cavity receiver, by making use of actual weather data as input. Python code for each experiment, with heat transfer factor validated using experimental data mentioned above:
Experiment 1.ipynb Experiment 2.ipynb Experiment 3.ipynb Experiment 4.ipynb Experiment 5.ipynb
import pandas as pd
Example dataset with new columns
data = [ { "title": "Pandas Library", "about": "Pandas is a Python library for data manipulation and analysis.", "procedure": "Install Pandas via pip, load data into DataFrames, clean and analyze data using built-in functions.", "content": """ Pandas provides data structures like Series and DataFrame for handling structured data. It supports indexing, slicing, aggregation, joining, and filtering… See the full description on the dataset page: https://huggingface.co/datasets/vicky3241/rag.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The goal of this project was to extract data from an NBA stats website using web scraping techniques and then perform data analysis to create visualizations using Python. The website used was "https://www.basketball-reference.com/", which contains data on players and teams in the NBA. The code for this project can be found on my GitHub repository at "https://github.com/Duggsdaddy/Srihith_I310D.git".
The data was extracted using the BeautifulSoup library in Python, and the data was stored in a Pandas DataFrame. The data was cleaned and processed to remove any unnecessary columns or rows, and the data types of the columns were checked and corrected where necessary.
The data was analyzed using various Python libraries such as Matplotlib, Seaborn, and Plotly to create visualizations like bar graphs, line graphs, and box plots. The visualizations were used to identify trends and patterns in the data.
The project follows ethical web scraping practices by not overwhelming the website with too many requests and by giving proper attribution to the website as the source of the data.
Overall, this project demonstrates how web scraping and data analysis techniques can be used to extract meaningful insights from data available on the internet.
Here's a data dictionary for the table
Player: string - name of the player Pos (Position): string - position played by the player Age: integer - age of the player as of February 1, 2023 Tm (Team): string - team the player belongs to G (Games Played): integer - number of games played by the player GS (Games Started): integer - number of games started by the player MP (Minutes Played): integer - total minutes played by the player FG (Field Goals): integer - number of field goals made by the player FGA (Field Goal Attempts): integer - number of field goal attempts by the player FG% (Field Goal Percentage): float - percentage of field goals made by the player 3P (3-Point Field Goals): integer - number of 3-point field goals made by the player 3PA (3-Point Field Goal Attempts): integer - number of 3-point field goal attempts by the player 3P% (3-Point Field Goal Percentage): float - percentage of 3-point field goals made by the player 2P (2-Point Field Goals): integer - number of 2-point field goals made by the player 2PA (2-point Field Goal Attempts): integer - number of 2-point field goal attempts by the player 2P% (2-Point Field Goal Percentage): float - percentage of 2-point field goals made by the player eFG% (Effective Field Goal Percentage): float - effective field goal percentage of the player FT (Free Throws): integer - number of free throws made by the player FTA (Free Throw Attempts): integer - number of free throw attempts by the player FT% (Free Throw Percentage): float - percentage of free throws made by the player ORB (Offensive Rebounds): integer - number of offensive rebounds by the player DRB (Defensive Rebounds): integer - number of defensive rebounds by the player TRB (Total Rebounds): integer - total rebounds by the player AST (Assists): integer - number of assists made by the player STL (Steals): integer - number of steals made by the player BLK (Blocks): integer - number of blocks made by the player TOV (Turnovers): integer - number of turnovers made by the player PF (Personal Fouls): integer - number of personal fouls made by the player PTS (Points): integer - total points scored by the player
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Alinaghi, N., Giannopoulos, I., Kattenbeck, M., & Raubal, M. (2025). Decoding wayfinding: analyzing wayfinding processes in the outdoor environment. International Journal of Geographical Information Science, 1–31. https://doi.org/10.1080/13658816.2025.2473599
Link to the paper: https://www.tandfonline.com/doi/full/10.1080/13658816.2025.2473599
The folder named “submission” contains the following:
ijgis.yml
: This file lists all the Python libraries and dependencies required to run the code.ijgis.yml
file to create a Python project and environment. Ensure you activate the environment before running the code.pythonProject
folder contains several .py
files and subfolders, each with specific functionality as described below..png
file for each column of the raw gaze and IMU recordings, color-coded with logged events..csv
files.overlapping_sliding_window_loop.py
.plot_labels_comparison(df, save_path, x_label_freq=10, figsize=(15, 5))
in line 116 visualizes the data preparation results. As this visualization is not used in the paper, the line is commented out, but if you want to see visually what has been changed compared to the original data, you can comment out this line..csv
files in the results folder.This part contains three main code blocks:
iii. One for the XGboost code with correct hyperparameter tuning:
Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically test the confidence threshold of
Note: Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically calculated the confidence threshold of the model (explained in the paper in Section 5.2. Part II: Decoding surveillance by sequence analysis) is given in this block in lines 361 to 380.
.csv
file containing inferred labels.The data is licensed under CC-BY, the code is licensed under MIT.