Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Facebook
TwitterZip file containing the dataset and R code used to perform the analyses described in “Aridity preferences alter the relative importance of abiotic and biotic drivers on plant species abundance in global drylands”
Facebook
TwitterA bike-sharing system is a service in which bikes are made available for shared use to individuals on a short term basis for a price or free. Many bike share systems allow people to borrow a bike from a "dock" which is usually computer-controlled wherein the user enters the payment information, and the system unlocks it. This bike can then be returned to another dock belonging to the same system.
A US bike-sharing provider BoomBikes has recently suffered considerable dip in their revenue due to the Corona pandemic. The company is finding it very difficult to sustain in the current market scenario. So, it has decided to come up with a mindful business plan to be able to accelerate its revenue.
In such an attempt, BoomBikes aspires to understand the demand for shared bikes among the people. They have planned this to prepare themselves to cater to the people's needs once the situation gets better all around and stand out from other service providers and make huge profits.
They have contracted a consulting company to understand the factors on which the demand for these shared bikes depends. Specifically, they want to understand the factors affecting the demand for these shared bikes in the American market. The company wants to know:
Based on various meteorological surveys and people's styles, the service provider firm has gathered a large dataset on daily bike demands across the American market based on some factors.
You are required to model the demand for shared bikes with the available independent variables. It will be used by the management to understand how exactly the demands vary with different features. They can accordingly manipulate the business strategy to meet the demand levels and meet the customer's expectations. Further, the model will be a good way for management to understand the demand dynamics of a new market.
In the dataset provided, you will notice that there are three columns named 'casual', 'registered', and 'cnt'. The variable 'casual' indicates the number casual users who have made a rental. The variable 'registered' on the other hand shows the total number of registered users who have made a booking on a given day. Finally, the 'cnt' variable indicates the total number of bike rentals, including both casual and registered. The model should be built taking this 'cnt' as the target variable.
When you're done with model building and residual analysis and have made predictions on the test set, just make sure you use the following two lines of code to calculate the R-squared score on the test set.
python
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)
- where y_test is the test data set for the target variable, and y_pred is the variable containing the predicted values of the target variable on the test set.
- Please perform this step as the R-squared score on the test set holds as a benchmark for your model.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Significant shifts in latitudinal optima of North American birds (PNAS)Paulo Mateus Martins, Marti J. Anderson, Winston L. Sweatman, and Andrew J. PunnettOverviewThis file contains the raw 2022 release of the North American breeding bird survey dataset (Ziolkowski Jr et al. 2022), as well as the filtered version used in our paper and the code that generated it. We also included code for using BirdLife's species distribution shapefiles to classify species as eastern or western based on their occurrence in the BBS dataset and to calculated the percentage of their range covered by the BBS sampling extent. Note that this code requires species distribution shapefiles, which are not provided but can be obtained directly from https://datazone.birdlife.org/species/requestdis.ReferenceD. J. Ziolkowski Jr., M. Lutmerding, V. I. Aponte, M. A. R. Hudson, North American breeding bird survey dataset 1966–2021: U.S. Geological Survey data release (2022), https://doi.org/10.5066/P97WAZE5Detailed file descriptioninfo_birds_names_shp: A data frame that links BBS species names (column Species) to shapefiles (column Species_BL). See the code2_sampling coverage.dat_raw_BBS_data_v2022: This R environment contains the raw BBS data from the 2022 release (https://www.sciencebase.gov/catalog/item/625f151ed34e85fa62b7f926). This object contains data frames created with the files "Routes.zip" (route information), "SpeciesList.txt" (bird taxonomy), and "50-StopData.zip" (actual counts per route and year). This object is the starting point for creating the dataset used in the paper, which was filtered to remove taxonomic uncertainties, as demonstrated in the "code1_build_long_wide_datasets" R script.code1_build_long_wide_datasets: This code filters the original dataset (dat_raw_BBS_data_v2022) to remove taxonomic uncertainties, assigns routes as either eastern or western based on regionalization using the dynamically constrained agglomerative clustering and partitioning method (see the Methods section of the paper), and generates the full long and wide versions of the dataset used in the analyses (dat2_filtered_data_long, dat3_filtered_data_wide).dat2_filtered_data_long: The filtered raw dataset in long form. This dataset was further filtered to remove nocturnal and aquatic species, as well as species with fewer than 30 occurrences, but the complete version is available here. To obtain the exact subset used in the analysis, filter this dataset using the column Species from datasets S1 or S3.dat3_filtered_data_wide: The filtered raw dataset in its widest form. This dataset was further filtered to remove nocturnal and aquatic species, as well as species with fewer than 30 occurrences, but the complete version is available here. To obtain the exact subset used in the analysis, filter this dataset using the column Species from datasets S1 or S3.code2_sampling coverage: This code determines how much of a bird distribution is covered by the BBS sampling extent (refer to Dataset S1). It is important to note that this script requires bird species distribution shapefiles from BirdLife International, which we are not permitted to share. The shapefiles can be requested directly at https://datazone.birdlife.org/species/requestdis
Facebook
TwitterThis data release contains lake and reservoir water surface temperature summary statistics calculated from Landsat 8 Analysis Ready Dataset (ARD) images available within the Conterminous United States (CONUS) from 2013-2023. All zip files within this data release contain nested directories using .parquet files to store the data. The file example_script_for_using_parquet.R contains example code for using the R arrow package (Richardson and others, 2024) to open and query the nested .parquet files. Limitations with this dataset include: - All biases inherent to the Landsat Surface Temperature product are retained in this dataset which can produce unrealistically high or low estimates of water temperature. This is observed to happen, for example, in cases with partial cloud coverage over a waterbody. - Some waterbodies are split between multiple Landsat Analysis Ready Data tiles or orbit footprints. In these cases, multiple waterbody-wide statistics may be reported - one for each data tile. The deepest point values will be extracted and reported for tile covering the deepest point. A total of 947 waterbodies are split between multiple tiles (see the multiple_tiles = “yes” column of site_id_tile_hv_crosswalk.csv). - Temperature data were not extracted from satellite images with more than 90% cloud cover. - Temperature data represents skin temperature at the water surface and may differ from temperature observations from below the water surface. Potential methods for addressing limitations with this dataset: - Identifying and removing unrealistic temperature estimates: - Calculate total percentage of cloud pixels over a given waterbody as: percent_cloud_pixels = wb_dswe9_pixels/(wb_dswe9_pixels + wb_dswe1_pixels), and filter percent_cloud_pixels by a desired percentage of cloud coverage. - Remove lakes with a limited number of water pixel values available (wb_dswe1_pixels < 10) - Filter waterbodies where the deepest point is identified as water (dp_dswe = 1) - Handling waterbodies split between multiple tiles: - These waterbodies can be identified using the "site_id_tile_hv_crosswalk.csv" file (column multiple_tiles = “yes”). A user could combine sections of the same waterbody by spatially weighting the values using the number of water pixels available within each section (wb_dswe1_pixels). This should be done with caution, as some sections of the waterbody may have data available on different dates. All zip files within this data release contain nested directories using .parquet files to store the data. The example_script_for_using_parquet.R contains example code for using the R arrow package to open and query the nested .parquet files. - "year_byscene=XXXX.zip" – includes temperature summary statistics for individual waterbodies and the deepest points (the furthest point from land within a waterbody) within each waterbody by the scene_date (when the satellite passed over). Individual waterbodies are identified by the National Hydrography Dataset (NHD) permanent_identifier included within the site_id column. Some of the .parquet files with the byscene datasets may only include one dummy row of data (identified by tile_hv="000-000"). This happens when no tabular data is extracted from the raster images because of clouds obscuring the image, a tile that covers mostly ocean with a very small amount of land, or other possible. An example file path for this dataset follows: year_byscene=2023/tile_hv=002-001/part-0.parquet -"year=XXXX.zip" – includes the summary statistics for individual waterbodies and the deepest points within each waterbody by the year (dataset=annual), month (year=0, dataset=monthly), and year-month (dataset=yrmon). The year_byscene=XXXX is used as input for generating these summary tables that aggregates temperature data by year, month, and year-month. Aggregated data is not available for the following tiles: 001-004, 001-010, 002-012, 028-013, and 029-012, because these tiles primarily cover ocean with limited land, and no output data were generated. An example file path for this dataset follows: year=2023/dataset=lakes_annual/tile_hv=002-001/part-0.parquet - "example_script_for_using_parquet.R" – This script includes code to download zip files directly from ScienceBase, identify HUC04 basins within desired landsat ARD grid tile, download NHDplus High Resolution data for visualizing, using the R arrow package to compile .parquet files in nested directories, and create example static and interactive maps. - "nhd_HUC04s_ingrid.csv" – This cross-walk file identifies the HUC04 watersheds within each Landsat ARD Tile grid. -"site_id_tile_hv_crosswalk.csv" - This cross-walk file identifies the site_id (nhdhr{permanent_identifier}) within each Landsat ARD Tile grid. This file also includes a column (multiple_tiles) to identify site_id's that fall within multiple Landsat ARD Tile grids. - "lst_grid.png" – a map of the Landsat grid tiles labelled by the horizontal – vertical ID.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
# Microplastics dataset
---
This repository includes the data and code used in Montoyal et al (ms number). More specifically:
- Dataset.xlsx includes data on MPS, bacteria and phytoplankton data, environmental variables, and ocean productivity
- ‘Microplastics code’ refers to the programming code used for the statistical analyses and plot construction using R software.
## Description of the data and file structure
The data is provided with variables as columns. The abbreviations are explained below:
id = sampling mesocosm and time
day = sampling day
temp = temperature (celsius degrees)
depth = ocean depth (meters)
mesocosm = mesocosm ID
plast_tot = total concentration of microplastics (g cm-3)
plast_ps = concentration of polystyrene (g cm-3)
plast_pp = concentration of polypropylene (g cm-3)
plast_pet = concentration of polyethylene terephthalate (g cm-3)
plast_pvc = concentration of polyvinyl chloride (g cm-3)
plast_pe = concentration of polyethylene (g cm-3)
ammonium = Concentration of ammonium (NH4+) (mg m-3)
hna = high nucleic acid concentration bacteria (% over total bacteria)
lna = low-nuceic acid concentrationbacteria (% over total bacteria)
chla_fluo = phytoplankton biomass, measured as chlorophyll a concentration (mg m-3)
fvfm = photosynthetic efficiency, measured as the ratio between variable and maxima fluorescence (Fv/Fm)
## Sharing/Access information
Not relevant
## Code/Software
‘Microplastics code’ refers to the programming code used for the statistical analyses and plot construction using R software.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This directory contains the cross-position activity recognition datasets used in the following paper. Please consider citing this article if you want to use the datasets.
Jindong Wang, Yiqiang Chen, Lisha Hu, Xiaohui Peng, and Philip S. Yu. Stratified Transfer Learning for Cross-domain Activity Recognition. 2018 IEEE International Conference on Pervasive Computing and Communications (PerCom).
These datasets are secondly constructed based on three public datasets: OPPORTUNITY (opp) [1], PAMAP2 (pamap2) [2], and UCI DSADS (dsads) [3].
Here are some useful information about this directory. Please feel free to contact jindongwang@outlook.com for more information.
This is NOT the raw data, since I have performed feature extraction and normalized the features into [-1,1]. The code for feature extraction can be found in here: https://github.com/jindongwang/activityrecognition/tree/master/code. Currently, there are 27 features for a single sensor. There are 81 features for a body part. More information can be found in above PerCom-18 paper.
There are 4 .mat files corresponding to each dataset: dsads.mat for UCI DSADS, opp_hl.mat and opp_ll.mat for OPPORTUNITY, and pamap.mat for PAMAP2. Note that opp_hl and opp_loco denotes 'high-level' and 'locomotion' activities, respectively. (1) dsads.mat: 9120 * 408. Columns 1~405 are features, listed in the order of 'Torso', 'Right Arm', 'Left Arm', 'Right Leg', and 'Left Leg'. Each position contains 81 columns of features. Columns 406~408 are labels. Column 406 is the activity sequence indicating the executing of activities (usually not used in experiments). Column 407 is the activity label (1~19). Column 408 denotes the person (1~8). (2) opp_hl.mat and opp_loco.mat: Same as dsads.mat. But they contain more body parts: 'Back', 'Right Upper Arm', 'Right Lower Arm', 'Left Upper Arm', 'Left Lower Arm', 'Right Shoe (Foot)', and 'Left Shoe (Foot)'. Of course we did not use the data of both shoes in our paper. Column 460 is the activity label (please refer to OPPORTUNITY dataset to see the meaning of those activities). Column 461 is the activity drill (also check the dataset information). Column 462 denotes the person (1~4). (3) pamap.mat: 7312 * 245. Columns 1~243 are features, listed in the order of 'Wrist', 'Chest', and 'Ankle'. Column 244 is the activity label. Column 245 denotes the person (1~9).
There are another 3 datasets with the prefix 'cross_', containing only 4 common classes of each dataset. This is for experimenting the cross-dataset activity recognition (see our PerCom-18 paper). The 4 common classes are lying, standing, walking, and sitting. (1) cross_dsads.mat: 1920*406. Columns 1~405 are features. Column 406 is labels. (2) cross_opp.mat: 5022*460. Columns 1~459 are features. Column 460 is labels. (3) cross_pamap.mat: 3063 * 244. Columns 1~243 are features. Column 244 is labels.
-------- Original references for the 3 datasets:
[1] R. Chavarriaga, H. Sagha, A. Calatroni, S. T. Digumarti, G. Troster, ¨ J. d. R. Millan, and D. Roggen, “The opportunity challenge: A bench- ´ mark database for on-body sensor-based activity recognition,” Pattern Recognition Letters, vol. 34, no. 15, pp. 2033–2042, 2013.
[2] A. Reiss and D. Stricker, “Introducing a new benchmarked dataset for activity monitoring,” in Wearable Computers (ISWC), 2012 16th International Symposium on. IEEE, 2012, pp. 108–109.
[3] B. Barshan and M. C. Yuksek, “Recognizing daily and sports activities ¨ in two open source machine learning environments using body-worn sensor units,” The Computer Journal, vol. 57, no. 11, pp. 1649–1667, 2014.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset, metadata, and R Markdown file required to reproduce all analyses in the manuscript "Priority effects alter the colonization success of a host-associated parasite and mutualist" (Burr et al. 2021 Ecology)
Facebook
TwitterVersion 5 release notes:
Removes support for SPSS and Excel data.Changes the crimes that are stored in each file. There are more files now with fewer crimes per file. The files and their included crimes have been updated below.
Adds in agencies that report 0 months of the year.Adds a column that indicates the number of months reported. This is generated summing up the number of unique months an agency reports data for. Note that this indicates the number of months an agency reported arrests for ANY crime. They may not necessarily report every crime every month. Agencies that did not report a crime with have a value of NA for every arrest column for that crime.Removes data on runaways.
Version 4 release notes:
Changes column names from "poss_coke" and "sale_coke" to "poss_heroin_coke" and "sale_heroin_coke" to clearly indicate that these column includes the sale of heroin as well as similar opiates such as morphine, codeine, and opium. Also changes column names for the narcotic columns to indicate that they are only for synthetic narcotics.
Version 3 release notes:
Add data for 2016.Order rows by year (descending) and ORI.Version 2 release notes:
Fix bug where Philadelphia Police Department had incorrect FIPS county code.
The Arrests by Age, Sex, and Race data is an FBI data set that is part of the annual Uniform Crime Reporting (UCR) Program data. This data contains highly granular data on the number of people arrested for a variety of crimes (see below for a full list of included crimes). The data sets here combine data from the years 1980-2015 into a single file. These files are quite large and may take some time to load.
All the data was downloaded from NACJD as ASCII+SPSS Setup files and read into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. For the R code used to clean this data, see here. https://github.com/jacobkap/crime_data. If you have any questions, comments, or suggestions please contact me at jkkaplan6@gmail.com.
I did not make any changes to the data other than the following. When an arrest column has a value of "None/not reported", I change that value to zero. This makes the (possible incorrect) assumption that these values represent zero crimes reported. The original data does not have a value when the agency reports zero arrests other than "None/not reported." In other words, this data does not differentiate between real zeros and missing values. Some agencies also incorrectly report the following numbers of arrests which I change to NA: 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 99999, 99998.
To reduce file size and make the data more manageable, all of the data is aggregated yearly. All of the data is in agency-year units such that every row indicates an agency in a given year. Columns are crime-arrest category units. For example, If you choose the data set that includes murder, you would have rows for each agency-year and columns with the number of people arrests for murder. The ASR data breaks down arrests by age and gender (e.g. Male aged 15, Male aged 18). They also provide the number of adults or juveniles arrested by race. Because most agencies and years do not report the arrestee's ethnicity (Hispanic or not Hispanic) or juvenile outcomes (e.g. referred to adult court, referred to welfare agency), I do not include these columns.
To make it easier to merge with other data, I merged this data with the Law Enforcement Agency Identifiers Crosswalk (LEAIC) data. The data from the LEAIC add FIPS (state, county, and place) and agency type/subtype. Please note that some of the FIPS codes have leading zeros and if you open it in Excel it will automatically delete those leading zeros.
I created 9 arrest categories myself. The categories are:
Total Male JuvenileTotal Female JuvenileTotal Male AdultTotal Female AdultTotal MaleTotal FemaleTotal JuvenileTotal AdultTotal ArrestsAll of these categories are based on the sums of the sex-age categories (e.g. Male under 10, Female aged 22) rather than using the provided age-race categories (e.g. adult Black, juvenile Asian). As not all agencies report the race data, my method is more accurate. These categories also make up the data in the "simple" version of the data. The "simple" file only includes the above 9 columns as the arrest data (all other columns in the data are just agency identifier columns). Because this "simple" data set need fewer columns, I include all offenses.
As the arrest data is very granular, and each category of arrest is its own column, there are dozens of columns per crime. To keep the data somewhat manageable, there are nine different files, eight which contain different crimes and the "simple" file. Each file contains the data for all years. The eight categories each have crimes belonging to a major crime category and do not overlap in crimes other than with the index offenses. Please note that the crime names provided below are not the same as the column names in the data. Due to Stata limiting column names to 32 characters maximum, I have abbreviated the crime names in the data. The files and their included crimes are:
Index Crimes
MurderRapeRobberyAggravated AssaultBurglaryTheftMotor Vehicle TheftArsonAlcohol CrimesDUIDrunkenness
LiquorDrug CrimesTotal DrugTotal Drug SalesTotal Drug PossessionCannabis PossessionCannabis SalesHeroin or Cocaine PossessionHeroin or Cocaine SalesOther Drug PossessionOther Drug SalesSynthetic Narcotic PossessionSynthetic Narcotic SalesGrey Collar and Property CrimesForgeryFraudStolen PropertyFinancial CrimesEmbezzlementTotal GamblingOther GamblingBookmakingNumbers LotterySex or Family CrimesOffenses Against the Family and Children
Other Sex Offenses
ProstitutionRapeViolent CrimesAggravated AssaultMurderNegligent ManslaughterRobberyWeapon Offenses
Other CrimesCurfewDisorderly ConductOther Non-trafficSuspicion
VandalismVagrancy
Simple
This data set has every crime and only the arrest categories that I created (see above).
If you have any questions, comments, or suggestions please contact me at jkkaplan6@gmail.com.
Facebook
TwitterThis survey is part of a long-term Bonneville Power Administration-funded effort by academic and federal scientists to understand coastal ecosystems and biological and physical processes that may influence recruitment variability of salmon in Pacific Northwest waters. Prior to a potential switch to the R/V Shimada as a long-term platform, we intend to compare catches between vessels (The R/V Shimada and the F/V Frosti) on the continental shelf of Washington. Sampling will occur during the day for 3 days (23-25 May) with surface trawls along pre-specified transects. One trawl will be performed as we leave the Strait of Juan de Fuca on the afternoon of the 22nd. In addition, a bongo net will be towed several times each night between 2300 and 0500 (nights of 19-22 June).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Background and Purpose:
There are very few publicly available datasets on real-world falls in scientific literature due to the lack of natural falls and the inherent difficulties in gathering biomechanical and physiological data from young subjects or older adults residing in their communities in a non-intrusive and user-friendly manner. This data gap hindered research on fall prevention strategies. Immersive Virtual Reality (VR) environments provide a unique solution.
This dataset supports research in fall prevention by providing an immersive VR setup that simulates diverse ecological environments and randomized visual disturbances, aimed at triggering and analyzing balance-compensatory reactions. The dataset is a unique tool for studying human balance responses to VR-induced perturbations, facilitating research that could inform training programs, wearable assistive technologies, and VR-based rehabilitation methods.
Dataset Content:The dataset includes:
Kinematic Data: Captured using a full-body Xsens MVN Awinda inertial measurement system, providing detailed movement data at 60 Hz.
Muscle Activity (EMG): Recorded at 1111 Hz using Delsys Trigno for tracking muscle contractions.
Electrodermal Activity (EDA)*: Captured at 100.21 Hz with a Shimmer GSR device on the dominant forearm to record physiological responses to perturbations.
Metadata: Includes participant demographics (age, height, weight, gender, dominant hand and foot), trial conditions, and perturbation characteristics (timing and type).
The files are named in the format "ParticipantX_labelled", where X represents the participant's number. Each file is provided in a .mat format, with data already synchronized across different sensor sources. The structure of each file is organized into the following columns:
Column 1: Label indicating the visual perturbation applied. 0 means no visual perturbation.
Column 2: Timestamp, providing the precise timing of each recorded data point.
Column 3: Frame identifier, which can be cross-referenced with the MVN file for detailed motion analysis.
Columns 4 to 985: Xsens motion capture features, exported directly from the MVN file.
Columns 986 to 993: EMG data - Tibialis Anterior (R&L), Gastrocnemius Medial Head (R&L), Rectus Femoris (R), Semitendinosus (R), External Oblique (R), Sternocleidomastoid (R).
Columns 994 to 1008: Shimmer data: Accelerometer (x,y,z), Gyroscope (x,y,z), Magnetometer (x,y,z), GSR Range, Skin Conductance, Skin Resistance, PPG, Pressure, Temperature.
In addition, we are also releasing the .MVN and .MVNA files for each participant (1 to 10), which provide comprehensive motion capture data and include the participants' body measurements, respectively. This additional data enables precise body modeling and further in-depth biomechanical analysis.
Participants & VR Headset:
Twelve healthy young adults (average age: 25.09 ± 2.81 years; height: 167.82 ± 8.40 cm; weight: 64.83 ± 7.77 kg; 6 males, 6 females) participated in this study (Table 1). Participants met the following criteria: i) healthy locomotion, ii) stable postural balance, iii) age ≥ 18 years, and iv) body weight < 135 kg.
Participants were excluded if they: i) had any condition affecting locomotion, ii) had epilepsy, vestibular disorders, or other neurological conditions impacting stability, iii) had undergone recent surgeries impacting mobility, iv) were involved in other experimental studies, v) were under judicial protection or guardianship, or vi) experienced complications using VR headsets (e.g., motion sickness).
All participants provided written informed consent, adhering to the ethical guidelines set by the University of Minho Ethics Committee (CEICVS 063/2021), in compliance with the Declaration of Helsinki and the Oviedo Convention.
To ensure unbiased reactions, participants were kept unaware of the specific protocol details. Visual disturbances were introduced in a random sequence and at various locations, enhancing the unpredictability of the experiment and simulating a naturalistic response.
The VR setup involved an HTC Vive Pro headset with two wirelessly synchronized base stations that tracked participants’ head movements within a 5m x 2.5m area. The base stations adjusted the VR environment’s perspective according to head movements, while controllers were used solely for setup purposes.
Table 1 - Participants' demographic information
Participant Height (cm) Weight (kg) Age Gender Dom. Hand Dom. Foot
1 159 56.5 23 F Right Right
2 157 55.3 28 F Right Right
3 174 67.1 31 M Right Right
4 176 73.8 23 M Right Right
5 158 57.3 23 F Right Right
6 181 70.9 27 M Right Right
7 171 73.3 23 M Right Right
8 159 69.2 28 F Right Right
9 177 57.3 22 M Right Right
10 171 75.5 25 M Right Right
11 163 58.1 23 F Right Right
12 168 63.7 25 F Right Right
Data Collection Methodology:
The experimental protocol was designed to integrate four essential components: (i) precise control over stimuli, (ii) high reproducibility of the experimental conditions, (iii) preservation of ecological validity, and (iv) promotion of real-world learning transfer.
Participant Instructions and Familiarization Trial: Before starting, participants were given specific instructions to (i) seek assistance if they experienced motion sickness, (ii) adjust the VR headset for comfort by modifying the lens distance and headset fit, (iii) stay within the defined virtual play area demarcated by a blue boundary, and (iv) complete a familiarization trial. During this trial, participants were encouraged to explore various virtual environments while performing a sequence of three key movements—walking forward, turning around, and returning to the initial location—without any visual perturbations. This familiarization phase helped participants acclimate to the virtual space in a controlled setting.
Experimental Protocol and Visual Perturbations: Participants were exposed to 11 different types of visual perturbations as outlined in Table 2, applied across a total of 35 unique perturbation variants (Table 3). Each variant involved the same type of perturbation, such as a clockwise Roll Axis Tilt, but varied in intensity (e.g., rotation speed) and was presented in randomized virtual locations. The selection of perturbation types was grounded in existing literature on visual disturbances. This design ensured that participants experienced a diverse range of visual effects in a manner that maintained ecological validity, supporting the potential for generalization to real-world scenarios where visual perturbations might occur spontaneously.
Protocol Flow and Randomized Presentation: Throughout the experimental protocol, each visual perturbation variant was presented three times, and participants engaged repeatedly in the familiarization activities over a nearly one-hour period. These activities—walking forward, turning around, and returning to the starting point—took place in a 5m x 2.5m physical space mirrored in VR, allowing participants to take 7–10 steps before turning. Participants were not informed of the timing or nature of any perturbations, which could occur unpredictably during their forward walk, adding a realistic element of surprise. After each return to the starting point, participants were relocated to a random position within the virtual environment, with the sequence of positions determined by a randomized, computer-generated order.
Table 2 - Visual perturbations' name and parameters (L - Lateral; B - Backward; F - Forward; S - Slip; T - Trip; CW- Clockwise; CCW - Counter-Clockwise)
Perturbation [Fall Category]
Parameters
Roll Axis Tilt - CW [L] [10º, 20º, 30º] during 0.5s
Roll Axis Tilt – CCW [L] [10º, 20º, 30º] during 0.5s
Support Surface ML Axis Translation - Bidirectional [L] Discrete Movement (static pauses between movements) – 1 m/s
AP Axis Translation - Front [F] 1 m/s
AP Axis Translation - Backwards [B] 1 m/s
Pitch Axis Tilt [S] 0º-25º, 60º/s
Virtual object with lower height than a real object [T] Variable object height
Roll-Pitch-Yaw Axis Tilt [Syncope] Sum of sinusoids drive each axis rotation
Scene Object Movement [L] Objects fly towards the subject’s head. Variable speeds
Vertigo Sensation [F/L] Walk at a comfortable speed. With and without avatar. House’s height
Axial Axis Translation [F/B/L] Free fall
Table 3 - Label Encoding
Visual Perturbation Label Visual Perturbation Label Visual Perturbation Label
Roll Indoor 1 CW10 1 Roll Indoor 1 CW20 2 Roll Indoor 1 CW30 3
Roll Indoor 1 CCW10 4 Roll Indoor 1 CCW20 5 Roll Indoor 1 CCW30 6
Roll Indoor 2 CW10 7 Roll Indoor 2 CW20 8 Roll Indoor 2 CW30 9
Roll Indoor 2 CCW10 10 Roll Indoor 2 CCW20 11 Roll Indoor 2 CCW30 12
Roll Outdoor CW10 13 Roll Outdoor CW20 14 Roll Outdoor CW30 15
Roll Outdoor CCW10 16 Roll Outdoor CCW20 17 Roll Outdoor CCW30 18
ML-Axis Trans. - Kitchen 19 AP-Axis Trans. - Corridor Forward 20 AP-Axis Trans. - Corridor Backward 21
Pitch Indoor - Bathroom (wet floor) 22 Pitch Indoor - Near Fridge (wet floor) 23 Roof Beam Walking - Vertigo 24
Roof Beam Walking - Vertigo No Avatar 25 Simple Roof - Vertigo 26 Simple Roof - Vertigo No Avatar 27
Pitch Outdoor - Near Car Oil 28 Trip - Sidewalk / Trip Shock 29/290 Bedroom Syncope 30
Garden - Object Avoidance 31 Electricity Pole - Vertigo 32 Electricity Pole - No Avatar 33
Free Fall 34 Climbing Virtual Stairs 35
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This data was obtained from the Maricopa County Assessor under the search "Fast Food". The query has approximately 1342 results, with only 1000 returned due MCA Data Policies.
Due to some Subdivision Name values posessing unescaped commas that interfered with Pandas' ability to properly align the columns, some manual cleaning in Libre Office was performed by me.
Aside from a handful of Null values, the data is fairly clean and requires little from Pandas.
Here are the sums and percentage of NULLS in the dataframe.
Interestingly, there are 17 NULLS that do not have any physical addresses. This amounts to 1.7% of values for the Address, City, and Zip, and are all corresponding rows for those missing values.
I have looked into a couple of these on the Maricopa County Assessor's GIS Portal, and they do not appear to have any assigned physical addresses. This is a good avenue of exploration for EDA. Possibly an error that could be corrected, or some obscure legal reason, but interesting nonetheless.
Additionally, there are 391 NULLS in Subdivision Name accounting for 39.1%. This is a feature that I am interested in exploring to determine if there are any predominant groups. It could also generate a list of Entities that can be searched later to see if the dataset can be enriched beyond it's initial 1,000 record limit.
There are 348 NULLS in the MCR column. This is the definition according to the MCA Glossary
MCR (MARICOPA COUNTY RECORDER NUMBER)
Often associated with recorded plat maps.
This seems to be an uninteresting nominal value, so I will drop this columns.
While Property Type and Rental have no NULLS, 100% of those values are Fast Food Restaurant and N (for No), and therefore offer no useful information, and will be dropped.
I will leave the S/T/R column, although it also seems to be uninteresting nominal values, I am curious if there are predominent groups, and since it also has no NULLS, might be useful for further data enrichment.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
!!!WARNING~~~This dataset has a large number of flaws and is unable to properly answer many questions that people generally use it to answer, such as whether national hate crimes are changing (or at least they use the data so improperly that they get the wrong answer). A large number of people using this data (academics, advocates, reporting, US Congress) do so inappropriately and get the wrong answer to their questions as a result. Indeed, many published papers using this data should be retracted. Before using this data I highly recommend that you thoroughly read my book on UCR data, particularly the chapter on hate crimes (https://ucrbook.com/hate-crimes.html) as well as the FBI's own manual on this data. The questions you could potentially answer well are relatively narrow and generally exclude any causal relationships. ~~~WARNING!!!For a comprehensive guide to this data and other UCR data, please see my book at ucrbook.comVersion 8 release notes:Adds 2019 and 2020 data. Please note that the FBI has retired UCR data ending in 2020 data so this will be the last UCR hate crime data they release. Changes .rda file to .rds.Version 7 release notes:Changes release notes description, does not change data.Version 6 release notes:Adds 2018 dataVersion 5 release notes:Adds data in the following formats: SPSS, SAS, and Excel.Changes project name to avoid confusing this data for the ones done by NACJD.Adds data for 1991.Fixes bug where bias motivation "anti-lesbian, gay, bisexual, or transgender, mixed group (lgbt)" was labeled "anti-homosexual (gay and lesbian)" prior to 2013 causing there to be two columns and zero values for years with the wrong label.All data is now directly from the FBI, not NACJD. The data initially comes as ASCII+SPSS Setup files and read into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. Version 4 release notes: Adds data for 2017.Adds rows that submitted a zero-report (i.e. that agency reported no hate crimes in the year). This is for all years 1992-2017. Made changes to categorical variables (e.g. bias motivation columns) to make categories consistent over time. Different years had slightly different names (e.g. 'anti-am indian' and 'anti-american indian') which I made consistent. Made the 'population' column which is the total population in that agency. Version 3 release notes: Adds data for 2016.Order rows by year (descending) and ORI.Version 2 release notes: Fix bug where Philadelphia Police Department had incorrect FIPS county code. The Hate Crime data is an FBI data set that is part of the annual Uniform Crime Reporting (UCR) Program data. This data contains information about hate crimes reported in the United States. Please note that the files are quite large and may take some time to open.Each row indicates a hate crime incident for an agency in a given year. I have made a unique ID column ("unique_id") by combining the year, agency ORI9 (the 9 character Originating Identifier code), and incident number columns together. Each column is a variable related to that incident or to the reporting agency. Some of the important columns are the incident date, what crime occurred (up to 10 crimes), the number of victims for each of these crimes, the bias motivation for each of these crimes, and the location of each crime. It also includes the total number of victims, total number of offenders, and race of offenders (as a group). Finally, it has a number of columns indicating if the victim for each offense was a certain type of victim or not (e.g. individual victim, business victim religious victim, etc.). The only changes I made to the data are the following. Minor changes to column names to make all column names 32 characters or fewer (so it can be saved in a Stata format), made all character values lower case, reordered columns. I also generated incident month, weekday, and month-day variables from the incident date variable included in the original data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Walking and running are mechanically and energetically different locomotion modes. For selecting one or another, speed is a parameter of paramount importance. Yet, both are likely controlled by similar low-dimensional neuronal networks that reflect in patterned muscle activations called muscle synergies. Here, we investigated how humans synergistically activate muscles during locomotion at different submaximal and maximal speeds. We analysed the duration and complexity (or irregularity) over time of motor primitives, the temporal components of muscle synergies. We found that the challenge imposed by controlling high-speed locomotion forces the central nervous system to produce muscle activation patterns that are wider and less complex relative to the duration of the gait cycle. The motor modules, or time-independent coefficients, were redistributed as locomotion speed changed. These outcomes show that robust locomotion control at challenging speeds is achieved by modulating the relative contribution of muscle activations and producing less complex and wider control signals, whereas slow speeds allow for more irregular control.
In this supplementary data set we made available: a) the metadata with anonymized participant information, b) the raw EMG, c) the touchdown and lift-off timings of the recorded limb, d) the filtered and time-normalized EMG, e) the muscle synergies extracted via NMF and f) the code to process the data, including the scripts to calculate the Higuchi's fractal dimension (HFD) of motor primitives. In total, 180 trials from 30 participants are included in the supplementary data set.
The file “metadata.dat” is available in ASCII and RData format and contains:
Code: the participant’s code
Group: the experimental group in which the participant was involved (G1 = walking and submaximal running; G2 = submaximal and maximal running)
Sex: the participant’s sex (M or F)
Speeds: the type of locomotion (W for walking or R for running) and speed at which the recordings were conducted in 10*[m/s]
Age: the participant’s age in years
Height: the participant’s height in [cm]
Mass: the participant’s body mass in [kg]
PB: 100 m-personal best time (for G2).
The "RAW_DATA.RData" R list consists of elements of S3 class "EMG", each of which is a human locomotion trial containing cycle segmentation timings and raw electromyographic (EMG) data from 13 muscles of the right-side leg. Cycle times are structured as data frames containing two columns that correspond to touchdown (first column) and lift-off (second column). Raw EMG data sets are also structured as data frames with one row for each recorded data point and 14 columns. The first column contains the incremental time in seconds. The remaining 13 columns contain the raw EMG data, named with the following muscle abbreviations: ME = gluteus medius, MA = gluteus maximus, FL = tensor fasciæ latæ, RF = rectus femoris, VM = vastus medialis, VL = vastus lateralis, ST = semitendinosus, BF = biceps femoris, TA = tibialis anterior, PL = peroneus longus, GM = gastrocnemius medialis, GL = gastrocnemius lateralis, SO = soleus. Please note that the following trials include less than 30 gait cycles (the actual number shown between parentheses): P16_R_83 (20), P16_R_95 (25), P17_R_28 (28), P17_R_83 (24), P17_R_95 (13), P18_R_95 (23), P19_R_95 (18), P20_R_28 (25), P20_R_42 (27), P20_R_95 (25), P22_R_28 (23), P23_R_28(29), P24_R_28 (28), P24_R_42 (29), P25_R_28 (29), P25_R_95 (28), P26_R_28 (29), P26_R_95 (28), P27_R_28 (28), P27_R_42 (29), P27_R_95 (24), P28_R_28 (29), P29_R_95 (17). All the other trials consist of 30 gait cycles. Trials are named like “P20_R_20,” where the characters “P20” indicate the participant number (in this example the 20th), the character “R” indicate the locomotion type (W=walking, R=running), and the numbers “20” indicate the locomotion speed in 10*m/s (in this case the speed is 2.0 m/s). The filtered and time-normalized emg data is named, following the same rules, like “FILT_EMG_P03_R_30”.
Old versions not compatible with the R package musclesyneRgies
The files containing the gait cycle breakdown are available in RData format, in the file named “CYCLE_TIMES.RData”. The files are structured as data frames with as many rows as the available number of gait cycles and two columns. The first column named “touchdown” contains the touchdown incremental times in seconds. The second column named “stance” contains the duration of each stance phase of the right foot in seconds. Each trial is saved as an element of a single R list. Trials are named like “CYCLE_TIMES_P20_R_20,” where the characters “CYCLE_TIMES” indicate that the trial contains the gait cycle breakdown times, the characters “P20” indicate the participant number (in this example the 20th), the character “R” indicate the locomotion type (W=walking, R=running), and the numbers “20” indicate the locomotion speed in 10*m/s (in this case the speed is 2.0 m/s). Please note that the following trials include less than 30 gait cycles (the actual number shown between parentheses): P16_R_83 (20), P16_R_95 (25), P17_R_28 (28), P17_R_83 (24), P17_R_95 (13), P18_R_95 (23), P19_R_95 (18), P20_R_28 (25), P20_R_42 (27), P20_R_95 (25), P22_R_28 (23), P23_R_28(29), P24_R_28 (28), P24_R_42 (29), P25_R_28 (29), P25_R_95 (28), P26_R_28 (29), P26_R_95 (28), P27_R_28 (28), P27_R_42 (29), P27_R_95 (24), P28_R_28 (29), P29_R_95 (17).
The files containing the raw, filtered and the normalized EMG data are available in RData format, in the files named “RAW_EMG.RData” and “FILT_EMG.RData”. The raw EMG files are structured as data frames with as many rows as the amount of recorded data points and 13 columns. The first column named “time” contains the incremental time in seconds. The remaining 12 columns contain the raw EMG data, named with muscle abbreviations that follow those reported above. Each trial is saved as an element of a single R list. Trials are named like “RAW_EMG_P03_R_30”, where the characters “RAW_EMG” indicate that the trial contains raw emg data, the characters “P03” indicate the participant number (in this example the 3rd), the character “R” indicate the locomotion type (see above), and the numbers “30” indicate the locomotion speed (see above). The filtered and time-normalized emg data is named, following the same rules, like “FILT_EMG_P03_R_30”.
The files containing the muscle synergies extracted from the filtered and normalized EMG data are available in RData format, in the files named “SYNS_H.RData” and “SYNS_W.RData”. The muscle synergies files are divided in motor primitives and motor modules and are presented as direct output of the factorisation and not in any functional order. Motor primitives are data frames with 6000 rows and a number of columns equal to the number of synergies (which might differ from trial to trial) plus one. The rows contain the time-dependent coefficients (motor primitives), one column for each synergy plus the time points (columns are named e.g. “time, Syn1, Syn2, Syn3”, where “Syn” is the abbreviation for “synergy”). Each gait cycle contains 200 data points, 100 for the stance and 100 for the swing phase which, multiplied by the 30 recorded cycles, result in 6000 data points distributed in as many rows. This output is transposed as compared to the one discussed in the methods section to improve user readability. Each set of motor primitives is saved as an element of a single R list. Trials are named like “SYNS_H_P12_W_07”, where the characters “SYNS_H” indicate that the trial contains motor primitive data, the characters “P12” indicate the participant number (in this example the 12th), the character “W” indicate the locomotion type (see above), and the numbers “07” indicate the speed (see above). Motor modules are data frames with 12 rows (number of recorded muscles) and a number of columns equal to the number of synergies (which might differ from trial to trial). The rows, named with muscle abbreviations that follow those reported above, contain the time-independent coefficients (motor modules), one for each synergy and for each muscle. Each set of motor modules relative to one synergy is saved as an element of a single R list. Trials are named like “SYNS_W_P22_R_20”, where the characters “SYNS_W” indicate that the trial contains motor module data, the characters “P22” indicate the participant number (in this example the 22nd), the character “W” indicates the locomotion type (see above), and the numbers “20” indicate the speed (see above). Given the nature of the NMF algorithm for the extraction of muscle synergies, the supplementary data set might show non-significant differences as compared to the one used for obtaining the results of this paper.
The files containing the HFD calculated from motor primitives are available in RData format, in the file named “HFD.RData”. HFD results are presented in a list of lists containing, for each trial, 1) the HFD, and 2) the interval time k used for the calculations. HFDs are presented as one number (mean HFD of the primitives for that trial), as are the interval times k. Trials are named like “HFD_P01_R_95”, where the characters “HFD” indicate that the trial contains HFD data, the characters “P01” indicate the participant number (in this example the 1st), the character “R” indicates the locomotion type (see above), and the numbers “95” indicate the speed (see above).
All the code used for the pre-processing of EMG data, the extraction of muscle synergies and the calculation of HFD is available in R format. Explanatory comments are profusely present throughout the script “muscle_synergies.R”.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
#https://www.kaggle.com/c/facial-keypoints-detection/details/getting-started-with-r #################################
###Variables for downloaded files data.dir <- ' ' train.file <- paste0(data.dir, 'training.csv') test.file <- paste0(data.dir, 'test.csv') #################################
###Load csv -- creates a data.frame matrix where each column can have a different type. d.train <- read.csv(train.file, stringsAsFactors = F) d.test <- read.csv(test.file, stringsAsFactors = F)
###In training.csv, we have 7049 rows, each one with 31 columns. ###The first 30 columns are keypoint locations, which R correctly identified as numbers. ###The last one is a string representation of the image, identified as a string.
###To look at samples of the data, uncomment this line:
###Let's save the first column as another variable, and remove it from d.train: ###d.train is our dataframe, and we want the column called Image. ###Assigning NULL to a column removes it from the dataframe
im.train <- d.train$Image d.train$Image <- NULL #removes 'image' from the dataframe
im.test <- d.test$Image d.test$Image <- NULL #removes 'image' from the dataframe
################################# #The image is represented as a series of numbers, stored as a string #Convert these strings to integers by splitting them and converting the result to integer
#strsplit splits the string #unlist simplifies its output to a vector of strings #as.integer converts it to a vector of integers. as.integer(unlist(strsplit(im.train[1], " "))) as.integer(unlist(strsplit(im.test[1], " ")))
###Install and activate appropriate libraries ###The tutorial is meant for Linux and OSx, where they use a different library, so: ###Replace all instances of %dopar% with %do%.
library("foreach", lib.loc="~/R/win-library/3.3")
###implement parallelization im.train <- foreach(im = im.train, .combine=rbind) %do% { as.integer(unlist(strsplit(im, " "))) } im.test <- foreach(im = im.test, .combine=rbind) %do% { as.integer(unlist(strsplit(im, " "))) } #The foreach loop will evaluate the inner command for each row in im.train, and combine the results with rbind (combine by rows). #%do% instructs R to do all evaluations in parallel. #im.train is now a matrix with 7049 rows (one for each image) and 9216 columns (one for each pixel):
###Save all four variables in data.Rd file ###Can reload them at anytime with load('data.Rd')
#each image is a vector of 96*96 pixels (96*96 = 9216). #convert these 9216 integers into a 96x96 matrix: im <- matrix(data=rev(im.train[1,]), nrow=96, ncol=96)
#im.train[1,] returns the first row of im.train, which corresponds to the first training image. #rev reverse the resulting vector to match the interpretation of R's image function #(which expects the origin to be in the lower left corner).
#To visualize the image we use R's image function: image(1:96, 1:96, im, col=gray((0:255)/255))
#Let’s color the coordinates for the eyes and nose points(96-d.train$nose_tip_x[1], 96-d.train$nose_tip_y[1], col="red") points(96-d.train$left_eye_center_x[1], 96-d.train$left_eye_center_y[1], col="blue") points(96-d.train$right_eye_center_x[1], 96-d.train$right_eye_center_y[1], col="green")
#Another good check is to see how variable is our data. #For example, where are the centers of each nose in the 7049 images? (this takes a while to run): for(i in 1:nrow(d.train)) { points(96-d.train$nose_tip_x[i], 96-d.train$nose_tip_y[i], col="red") }
#there are quite a few outliers -- they could be labeling errors. Looking at one extreme example we get this: #In this case there's no labeling error, but this shows that not all faces are centralized idx <- which.max(d.train$nose_tip_x) im <- matrix(data=rev(im.train[idx,]), nrow=96, ncol=96) image(1:96, 1:96, im, col=gray((0:255)/255)) points(96-d.train$nose_tip_x[idx], 96-d.train$nose_tip_y[idx], col="red")
#One of the simplest things to try is to compute the mean of the coordinates of each keypoint in the training set and use that as a prediction for all images colMeans(d.train, na.rm=T)
#To build a submission file we need to apply these computed coordinates to the test instances: p <- matrix(data=colMeans(d.train, na.rm=T), nrow=nrow(d.test), ncol=ncol(d.train), byrow=T) colnames(p) <- names(d.train) predictions <- data.frame(ImageId = 1:nrow(d.test), p) head(predictions)
#The expected submission format has one one keypoint per row, but we can easily get that with the help of the reshape2 library:
library(...
Facebook
TwitterThis data originates from Crossref API. It has metadata on the articles contained in Data Citation Corpus where the citation pair dataset is a DOI.
How to recreate this dataset in Jupyter Notebook:
1) Prepare list of articles to query ```python import pandas as pd
CITATIONS_PARQUET = "data_citation_corpus_filtered_v4.1.parquet"
citation_pairs = pd.read_parquet(CITATIONS_PARQUET)
citation_pairs = citation_pairs[ ~((citation_pairs['dataset'].str.contains("https")) & (~citation_pairs['dataset'].str.contains("doi.org"))) ]
citation_pairs = citation_pairs[ ~citation_pairs['dataset'].str.contains("figshare") ]
citation_pairs['is_doi'] = citation_pairs['dataset'].str.contains('doi.org', na=False)
citation_pairs_doi = citation_pairs[citation_pairs['is_doi'] == True].copy()
articles = list(set(citation_pairs_doi['publication'].to_list()))
articles = [doi.replace("_", "/") for doi in articles]
with open("articles.txt", "w") as f: for article in articles: f.write(f"{article} ") ```
2) Query articles from CrossRef API
%%writefile enrich.py
#!pip install -q aiolimiter
import sys, pathlib, asyncio, aiohttp, orjson, sqlite3, time
from aiolimiter import AsyncLimiter
# ---------- config ----------
HEADERS = {"User-Agent": "ForDataCiteEnrichment (mailto:your_email)"} # Put your email here
MAX_RPS = 45 # polite pool limit (50), leave head-room
BATCH_SIZE = 10_000 # rows per INSERT
DB_PATH = pathlib.Path("crossref.sqlite").resolve()
ARTICLES = pathlib.Path("articles.txt")
# -----------------------------
# ---- platform tweak: prefer selector loop on Windows ----
if sys.platform == "win32":
asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
# ---- read the DOI list ----
with ARTICLES.open(encoding="utf-8") as f:
DOIS = [line.strip() for line in f if line.strip()]
# ---- make sure DB & table exist BEFORE the async part ----
DB_PATH.parent.mkdir(parents=True, exist_ok=True)
with sqlite3.connect(DB_PATH) as db:
db.execute("""
CREATE TABLE IF NOT EXISTS works (
doi TEXT PRIMARY KEY,
json TEXT
)
""")
db.execute("PRAGMA journal_mode=WAL;") # better concurrency
# ---------- async section ----------
limiter = AsyncLimiter(MAX_RPS, 1) # 45 req / second
sem = asyncio.Semaphore(100) # cap overall concurrency
async def fetch_one(session, doi: str):
url = f"https://api.crossref.org/works/{doi}"
async with limiter, sem:
try:
async with session.get(url, headers=HEADERS, timeout=10) as r:
if r.status == 404: # common “not found”
return doi, None
r.raise_for_status() # propagate other 4xx/5xx
return doi, await r.json()
except Exception as e:
return doi, None # log later, don’t crash
async def main():
start = time.perf_counter()
db = sqlite3.connect(DB_PATH) # KEEP ONE connection
db.execute("PRAGMA synchronous = NORMAL;") # speed tweak
async with aiohttp.ClientSession(json_serialize=orjson.dumps) as s:
for chunk_start in range(0, len(DOIS), BATCH_SIZE):
slice_ = DOIS[chunk_start:chunk_start + BATCH_SIZE]
tasks = [asyncio.create_task(fetch_one(s, d)) for d in slice_]
results = await asyncio.gather(*tasks) # all tuples, no exc
good_rows, bad_dois = [], []
for doi, payload in results:
if payload is None:
bad_dois.append(doi)
else:
good_rows.append((doi, orjson.dumps(payload).decode()))
if good_rows:
db.executemany(
"INSERT OR IGNORE INTO works (doi, json) VALUES (?, ?)",
good_rows,
)
db.commit()
if bad_dois: # append for later retry
with open("failures.log", "a", encoding="utf-8") as fh:
fh.writelines(f"{d}
" for d in bad_dois)
done = chunk_start + len(slice_)
rate = done / (time.perf_counter() - start)
print(f"{done:,}/{len(DOIS):,} ({rate:,.1f} DOI/s)")
db.close()
if _name_ == "_main_":
asyncio.run(main())
Then run:
python
!python enrich.py
3) Finally extract the necessary fields
import sqlite3
import orjson
i...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Original dataset The original year-2019 dataset was downloaded from the World Bank Databank by the following approach on July 23, 2022.
Database: "World Development Indicators" Country: 266 (all available) Series: "CO2 emissions (kt)", "GDP (current US$)", "GNI, Atlas method (current US$)", and "Population, total" Time: 1960, 1970, 1980, 1990, 2000, 2010, 2017, 2018, 2019, 2020, 2021 Layout: Custom -> Time: Column, Country: Row, Series: Column Download options: Excel
Preprocessing
With libreoffice,
remove non-country entries (lines after Zimbabwe), shorten column names for easy processing: Country Name -> Country, Country Code -> Code, "XXXX ... GNI ..." -> GNI_1990, etc (notice '_', not '-', for R), remove unnesssary rows after line Zimbabwe.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/