Facebook
Twitterhttps://doi.org/10.5061/dryad.brv15dvh0
On each trial, participants heard a stimulus and clicked a box on the computer screen to indicate whether they heard "SET" or "SAT." Responses of "SET" are coded as 0 and responses of "SAT" are coded as 1. The continuum steps, from 1-7, for duration and spectral quality cues of the stimulus on each trial are named "DurationStep" and "SpectralStep," respectively. Group (young or older adult) and listening condition (quiet or noise) information are provided for each row of the dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Significant shifts in latitudinal optima of North American birds (PNAS)Paulo Mateus Martins, Marti J. Anderson, Winston L. Sweatman, and Andrew J. PunnettOverviewThis file contains the raw 2022 release of the North American breeding bird survey dataset (Ziolkowski Jr et al. 2022), as well as the filtered version used in our paper and the code that generated it. We also included code for using BirdLife's species distribution shapefiles to classify species as eastern or western based on their occurrence in the BBS dataset and to calculated the percentage of their range covered by the BBS sampling extent. Note that this code requires species distribution shapefiles, which are not provided but can be obtained directly from https://datazone.birdlife.org/species/requestdis.ReferenceD. J. Ziolkowski Jr., M. Lutmerding, V. I. Aponte, M. A. R. Hudson, North American breeding bird survey dataset 1966–2021: U.S. Geological Survey data release (2022), https://doi.org/10.5066/P97WAZE5Detailed file descriptioninfo_birds_names_shp: A data frame that links BBS species names (column Species) to shapefiles (column Species_BL). See the code2_sampling coverage.dat_raw_BBS_data_v2022: This R environment contains the raw BBS data from the 2022 release (https://www.sciencebase.gov/catalog/item/625f151ed34e85fa62b7f926). This object contains data frames created with the files "Routes.zip" (route information), "SpeciesList.txt" (bird taxonomy), and "50-StopData.zip" (actual counts per route and year). This object is the starting point for creating the dataset used in the paper, which was filtered to remove taxonomic uncertainties, as demonstrated in the "code1_build_long_wide_datasets" R script.code1_build_long_wide_datasets: This code filters the original dataset (dat_raw_BBS_data_v2022) to remove taxonomic uncertainties, assigns routes as either eastern or western based on regionalization using the dynamically constrained agglomerative clustering and partitioning method (see the Methods section of the paper), and generates the full long and wide versions of the dataset used in the analyses (dat2_filtered_data_long, dat3_filtered_data_wide).dat2_filtered_data_long: The filtered raw dataset in long form. This dataset was further filtered to remove nocturnal and aquatic species, as well as species with fewer than 30 occurrences, but the complete version is available here. To obtain the exact subset used in the analysis, filter this dataset using the column Species from datasets S1 or S3.dat3_filtered_data_wide: The filtered raw dataset in its widest form. This dataset was further filtered to remove nocturnal and aquatic species, as well as species with fewer than 30 occurrences, but the complete version is available here. To obtain the exact subset used in the analysis, filter this dataset using the column Species from datasets S1 or S3.code2_sampling coverage: This code determines how much of a bird distribution is covered by the BBS sampling extent (refer to Dataset S1). It is important to note that this script requires bird species distribution shapefiles from BirdLife International, which we are not permitted to share. The shapefiles can be requested directly at https://datazone.birdlife.org/species/requestdis
Facebook
TwitterThis data release contains lake and reservoir water surface temperature summary statistics calculated from Landsat 8 Analysis Ready Dataset (ARD) images available within the Conterminous United States (CONUS) from 2013-2023. All zip files within this data release contain nested directories using .parquet files to store the data. The file example_script_for_using_parquet.R contains example code for using the R arrow package (Richardson and others, 2024) to open and query the nested .parquet files. Limitations with this dataset include: - All biases inherent to the Landsat Surface Temperature product are retained in this dataset which can produce unrealistically high or low estimates of water temperature. This is observed to happen, for example, in cases with partial cloud coverage over a waterbody. - Some waterbodies are split between multiple Landsat Analysis Ready Data tiles or orbit footprints. In these cases, multiple waterbody-wide statistics may be reported - one for each data tile. The deepest point values will be extracted and reported for tile covering the deepest point. A total of 947 waterbodies are split between multiple tiles (see the multiple_tiles = “yes” column of site_id_tile_hv_crosswalk.csv). - Temperature data were not extracted from satellite images with more than 90% cloud cover. - Temperature data represents skin temperature at the water surface and may differ from temperature observations from below the water surface. Potential methods for addressing limitations with this dataset: - Identifying and removing unrealistic temperature estimates: - Calculate total percentage of cloud pixels over a given waterbody as: percent_cloud_pixels = wb_dswe9_pixels/(wb_dswe9_pixels + wb_dswe1_pixels), and filter percent_cloud_pixels by a desired percentage of cloud coverage. - Remove lakes with a limited number of water pixel values available (wb_dswe1_pixels < 10) - Filter waterbodies where the deepest point is identified as water (dp_dswe = 1) - Handling waterbodies split between multiple tiles: - These waterbodies can be identified using the "site_id_tile_hv_crosswalk.csv" file (column multiple_tiles = “yes”). A user could combine sections of the same waterbody by spatially weighting the values using the number of water pixels available within each section (wb_dswe1_pixels). This should be done with caution, as some sections of the waterbody may have data available on different dates. All zip files within this data release contain nested directories using .parquet files to store the data. The example_script_for_using_parquet.R contains example code for using the R arrow package to open and query the nested .parquet files. - "year_byscene=XXXX.zip" – includes temperature summary statistics for individual waterbodies and the deepest points (the furthest point from land within a waterbody) within each waterbody by the scene_date (when the satellite passed over). Individual waterbodies are identified by the National Hydrography Dataset (NHD) permanent_identifier included within the site_id column. Some of the .parquet files with the byscene datasets may only include one dummy row of data (identified by tile_hv="000-000"). This happens when no tabular data is extracted from the raster images because of clouds obscuring the image, a tile that covers mostly ocean with a very small amount of land, or other possible. An example file path for this dataset follows: year_byscene=2023/tile_hv=002-001/part-0.parquet -"year=XXXX.zip" – includes the summary statistics for individual waterbodies and the deepest points within each waterbody by the year (dataset=annual), month (year=0, dataset=monthly), and year-month (dataset=yrmon). The year_byscene=XXXX is used as input for generating these summary tables that aggregates temperature data by year, month, and year-month. Aggregated data is not available for the following tiles: 001-004, 001-010, 002-012, 028-013, and 029-012, because these tiles primarily cover ocean with limited land, and no output data were generated. An example file path for this dataset follows: year=2023/dataset=lakes_annual/tile_hv=002-001/part-0.parquet - "example_script_for_using_parquet.R" – This script includes code to download zip files directly from ScienceBase, identify HUC04 basins within desired landsat ARD grid tile, download NHDplus High Resolution data for visualizing, using the R arrow package to compile .parquet files in nested directories, and create example static and interactive maps. - "nhd_HUC04s_ingrid.csv" – This cross-walk file identifies the HUC04 watersheds within each Landsat ARD Tile grid. -"site_id_tile_hv_crosswalk.csv" - This cross-walk file identifies the site_id (nhdhr{permanent_identifier}) within each Landsat ARD Tile grid. This file also includes a column (multiple_tiles) to identify site_id's that fall within multiple Landsat ARD Tile grids. - "lst_grid.png" – a map of the Landsat grid tiles labelled by the horizontal – vertical ID.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Facebook
TwitterThis is a pre-release of Seshat-NLP, a dataset of labelled text segments derived from the Seshat Databank. These text segments were originally used in the Seshat Databank to justify the coding of historical "facts". A data point in the Seshat Databank would describe a property of a past society at a certain time (-range). We use these data points with their textual justifications to extract a NLP dataset of text segments accompanied by topic labels.
The Dataset is organised around unique text segments (i.e.: each row one unique segment), these segments are connected with labels that designate the historical information that is contained within the text. Each segment has at least one 4-tuple of labels associated with it but can have more. The labels are ("variable_name", "variable_id", "value", and "polity_id").
Below is a simplified example row in our dataset (exemplary data!):
| Description | Labels ("variable", "var_id", "value", "polity") | Reference |
| Thebes was the capital … | [("Capital", "…","Thebes", "Egypt Middle Kingdom"),…] |
{"Title" : "The Oxford Encyclopedia of …", "Author" : "…", "DOI" : "…", …} |
Our dataset partially consists of segments taken from scientific literature on history, we also pair these segments with labels that denote their content. We are currently looking into the legal considerations of releasing such data. In the meanwhile, we have added information to our dataset that allows the identification of the pertaining documents for each description.
This file is a PostgreSQL dump that can be used to instantiate the PostgreSQL table with all the data.
The table zenodoexport has the following columns:
| Column Name | Column Description |
| id | row identifier |
| description | textual justification of coded value |
| labels | labels for description |
| reference_information | information required to retrieve documents |
| description_hash | utility column |
| zodero_id | utility column |
The hierarchy_graph.gexf file is a xml based export of the hierarchy graph that can be used to tie variables to their hierarchical position in the Seshat codebook.
The labels column contains a list of 4-tuples which in order denote "variable_name", "variable_id", "value", and "polity_id".
We use this structure to allow for a single segment/description to have multiple 4-tuples of labels, this is useful when the same of description has been used to justify multiple "facts" in the original Seshat Databank.
The variable_ids can be used to tie variable labels to nodes in the hierarchy of the Seshat codebook.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data and results from the Imageomics Workflow. These include data files from the Fish-AIR repository (https://fishair.org/) for purposes of reproducibility and outputs from the application-specific imageomics workflow contained in the Minnow_Segmented_Traits repository (https://github.com/hdr-bgnn/Minnow_Segmented_Traits).
Fish-AIR: This is the dataset downloaded from Fish-AIR, filtering for Cyprinidae and the Great Lakes Invasive Network (GLIN) from the Illinois Natural History Survey (INHS) dataset. These files contain information about fish images, fish image quality, and path for downloading the images. The data download ARK ID is dtspz368c00q. (2023-04-05). The following files are unaltered from the Fish-AIR download. We use the following files:
extendedImageMetadata.csv: A CSV file containing information about each image file. It has the following columns: ARKID, fileNameAsDelivered, format, createDate, metadataDate, size, width, height, license, publisher, ownerInstitutionCode. Column definitions are defined https://fishair.org/vocabulary.html and the persistent column identifiers are in the meta.xml file.
imageQualityMetadata.csv: A CSV file containing information about the quality of each image. It has the following columns: ARKID, license, publisher, ownerInstitutionCode, createDate, metadataDate, specimenQuantity, containsScaleBar, containsLabel, accessionNumberValidity, containsBarcode, containsColorBar, nonSpecimenObjects, partsOverlapping, specimenAngle, specimenView, specimenCurved, partsMissing, allPartsVisible, partsFolded, brightness, uniformBackground, onFocus, colorIssue, quality, resourceCreationTechnique. Column definitions are defined https://fishair.org/vocabulary.html and the persistent column identifiers are in the meta.xml file.
multimedia.csv: A CSV file containing information about image downloads. It has the following columns: ARKID, parentARKID, accessURI, createDate, modifyDate, fileNameAsDelivered, format, scientificName, genus, family, batchARKID, batchName, license, source, ownerInstitutionCode. Column definitions are defined https://fishair.org/vocabulary.html and the persistent column identifiers are in the meta.xml file.
meta.xml: A XML file with the metadata about the column indices and URIs for each file contained in the original downloaded zip file. This file is used in the fish-air.R script to extract the indices for column headers.
The outputs from the Minnow_Segmented_Traits workflow are:
sampling.df.seg.csv: Table with tallies of the sampling of image data per species during the data cleaning and data analysis. This is used in Table S1 in Balk et al.
presence.absence.matrix.csv: The Presence-Absence matrix from segmentation, not cleaned. This is the result of the combined outputs from the presence.json files created by the rule “create_morphological_analysis”. The cleaned version of this matrix is shown as Table S3 in Balk et al.
heatmap.avg.blob.png and heatmap.sd.blob.png: Heatmaps of average area of biggest blob per trait (heatmap.avg.blob.png) and standard deviation of area of biggest blob per trait (heatmap.sd.blob.png). These images are also in Figure S3 of Balk et al.
minnow.filtered.from.iqm.csv: Filtered fish image data set after filtering (see methods in Balk et al. for filter categories).
burress.minnow.sp.filtered.from.iqm.csv: Fish image data set after filtering and selecting species from Burress et al. 2017.
Facebook
TwitterVersion 5 release notes:
Removes support for SPSS and Excel data.Changes the crimes that are stored in each file. There are more files now with fewer crimes per file. The files and their included crimes have been updated below.
Adds in agencies that report 0 months of the year.Adds a column that indicates the number of months reported. This is generated summing up the number of unique months an agency reports data for. Note that this indicates the number of months an agency reported arrests for ANY crime. They may not necessarily report every crime every month. Agencies that did not report a crime with have a value of NA for every arrest column for that crime.Removes data on runaways.
Version 4 release notes:
Changes column names from "poss_coke" and "sale_coke" to "poss_heroin_coke" and "sale_heroin_coke" to clearly indicate that these column includes the sale of heroin as well as similar opiates such as morphine, codeine, and opium. Also changes column names for the narcotic columns to indicate that they are only for synthetic narcotics.
Version 3 release notes:
Add data for 2016.Order rows by year (descending) and ORI.Version 2 release notes:
Fix bug where Philadelphia Police Department had incorrect FIPS county code.
The Arrests by Age, Sex, and Race data is an FBI data set that is part of the annual Uniform Crime Reporting (UCR) Program data. This data contains highly granular data on the number of people arrested for a variety of crimes (see below for a full list of included crimes). The data sets here combine data from the years 1980-2015 into a single file. These files are quite large and may take some time to load.
All the data was downloaded from NACJD as ASCII+SPSS Setup files and read into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. For the R code used to clean this data, see here. https://github.com/jacobkap/crime_data. If you have any questions, comments, or suggestions please contact me at jkkaplan6@gmail.com.
I did not make any changes to the data other than the following. When an arrest column has a value of "None/not reported", I change that value to zero. This makes the (possible incorrect) assumption that these values represent zero crimes reported. The original data does not have a value when the agency reports zero arrests other than "None/not reported." In other words, this data does not differentiate between real zeros and missing values. Some agencies also incorrectly report the following numbers of arrests which I change to NA: 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 99999, 99998.
To reduce file size and make the data more manageable, all of the data is aggregated yearly. All of the data is in agency-year units such that every row indicates an agency in a given year. Columns are crime-arrest category units. For example, If you choose the data set that includes murder, you would have rows for each agency-year and columns with the number of people arrests for murder. The ASR data breaks down arrests by age and gender (e.g. Male aged 15, Male aged 18). They also provide the number of adults or juveniles arrested by race. Because most agencies and years do not report the arrestee's ethnicity (Hispanic or not Hispanic) or juvenile outcomes (e.g. referred to adult court, referred to welfare agency), I do not include these columns.
To make it easier to merge with other data, I merged this data with the Law Enforcement Agency Identifiers Crosswalk (LEAIC) data. The data from the LEAIC add FIPS (state, county, and place) and agency type/subtype. Please note that some of the FIPS codes have leading zeros and if you open it in Excel it will automatically delete those leading zeros.
I created 9 arrest categories myself. The categories are:
Total Male JuvenileTotal Female JuvenileTotal Male AdultTotal Female AdultTotal MaleTotal FemaleTotal JuvenileTotal AdultTotal ArrestsAll of these categories are based on the sums of the sex-age categories (e.g. Male under 10, Female aged 22) rather than using the provided age-race categories (e.g. adult Black, juvenile Asian). As not all agencies report the race data, my method is more accurate. These categories also make up the data in the "simple" version of the data. The "simple" file only includes the above 9 columns as the arrest data (all other columns in the data are just agency identifier columns). Because this "simple" data set need fewer columns, I include all offenses.
As the arrest data is very granular, and each category of arrest is its own column, there are dozens of columns per crime. To keep the data somewhat manageable, there are nine different files, eight which contain different crimes and the "simple" file. Each file contains the data for all years. The eight categories each have crimes belonging to a major crime category and do not overlap in crimes other than with the index offenses. Please note that the crime names provided below are not the same as the column names in the data. Due to Stata limiting column names to 32 characters maximum, I have abbreviated the crime names in the data. The files and their included crimes are:
Index Crimes
MurderRapeRobberyAggravated AssaultBurglaryTheftMotor Vehicle TheftArsonAlcohol CrimesDUIDrunkenness
LiquorDrug CrimesTotal DrugTotal Drug SalesTotal Drug PossessionCannabis PossessionCannabis SalesHeroin or Cocaine PossessionHeroin or Cocaine SalesOther Drug PossessionOther Drug SalesSynthetic Narcotic PossessionSynthetic Narcotic SalesGrey Collar and Property CrimesForgeryFraudStolen PropertyFinancial CrimesEmbezzlementTotal GamblingOther GamblingBookmakingNumbers LotterySex or Family CrimesOffenses Against the Family and Children
Other Sex Offenses
ProstitutionRapeViolent CrimesAggravated AssaultMurderNegligent ManslaughterRobberyWeapon Offenses
Other CrimesCurfewDisorderly ConductOther Non-trafficSuspicion
VandalismVagrancy
Simple
This data set has every crime and only the arrest categories that I created (see above).
If you have any questions, comments, or suggestions please contact me at jkkaplan6@gmail.com.
Facebook
TwitterThis dataset contains all the data and code needed to reproduce the analyses in the manuscript: Penn, H. J., & Read, Q. D. (2023). Stem borer herbivory dependent on interactions of sugarcane variety, associated traits, and presence of prior borer damage. Pest Management Science. https://doi.org/10.1002/ps.7843 Included are two .Rmd notebooks containing all code required to reproduce the analyses in the manuscript, two .html file of rendered notebook output, three .csv data files that are loaded and analyzed, and a .zip file of intermediate R objects that are generated during the model fitting and variable selection process. Notebook files 01_boring_analysis.Rmd: This RMarkdown notebook contains R code to read and process the raw data, create exploratory data visualizations and tables, fit a Bayesian generalized linear mixed model, extract output from the statistical model, and create graphs and tables summarizing the model output including marginal means for different varieties and contrasts between crop years. 02_trait_covariate_analysis.Rmd: This RMarkdown notebook contains R code to read raw variety-level trait data, perform feature selection based on correlations between traits, fit another generalized linear mixed model using traits as predictors, and create graphs and tables from that model output including marginal means by categorical trait and marginal trends by continuous trait. HTML files These HTML files contain the rendered output of the two RMarkdown notebooks. They were generated by Quentin Read on 2023-08-30 and 2023-08-15. 01_boring_analysis.html 02_trait_covariate_analysis.html CSV data files These files contain the raw data. To recreate the notebook output the CSV files should be at the file path project/data/ relative to where the notebook is run. Columns are described below. BoredInternodes_26April2022_no format.csv: primary data file with sugarcane borer (SCB) damage Columns A-C are the year, date, and location. All location values are the same. Column D identifies which experiment the data point was collected from. Column E, Stubble, indicates the crop year (plant cane or first stubble) Column F indicates the variety Column G indicates the plot (integer ID) Column H indicates the stalk within each plot (integer ID) Column I, # Internodes, indicates how many internodes were on the stalk Columns J-AM are numbered 1-30 and indicate whether SCB damage was observed on that internode (0 if no, 1 if yes, blank cell if that internode was not present on the stalk) Column AN indicates the experimental treatment for those rows that are part of a manipulative experiment Column AO contains notes variety_lookup.csv: summary information for the 16 varieties analyzed in this study Column A is the variety name Column B is the total number of stalks assessed for SCB damage for that variety across all years Column C is the number of years that variety is present in the data Column D, Stubble, indicates which crop years were sampled for that variety ("PC" if only plant cane, "PC, 1S" if there are data for both plant cane and first stubble crop years) Column E, SCB resistance, is a categorical designation with four values: susceptible, moderately susceptible, moderately resistant, resistant Column F is the literature reference for the SCB resistance value Select_variety_traits_12Dec2022.csv: variety-level traits for the 16 varieties analyzed in this study Column A is the variety name Column B is the SCB resistance designation as an integer Column C is the categorical SCB resistance designation (see above) Columns D-I are continuous traits from year 1 (plant cane), including sugar (Mg/ha), biomass or aboveground cane production (Mg/ha), TRS or theoretically recoverable sugar (g/kg), stalk weight of individual stalks (kg), stalk population density (stalks/ha), and fiber content of stalk (percent). Columns J-O are the same continuous traits from year 2 (first stubble) Columns P-V are categorical traits (in some cases continuous traits binned into categories): maturity timing, amount of stalk wax, amount of leaf sheath wax, amount of leaf sheath hair, tightness of leaf sheath, whether leaf sheath becomes necrotic with age, and amount of collar hair. ZIP file of intermediate R objects To recreate the notebook output without having to run computationally intensive steps, unzip the archive. The fitted model objects should be at the file path project/ relative to where the notebook is run. intermediate_R_objects.zip: This file contains intermediate R objects that are generated during the model fitting and variable selection process. You may use the R objects in the .zip file if you would like to reproduce final output including figures and tables without having to refit the computationally intensive statistical models. binom_fit_intxns_updated_only5yrs.rds: fitted brms model object for the main statistical model binom_fit_reduced.rds: fitted brms model object for the trait covariate analysis marginal_trends.RData: calculated values of the estimated marginal trends with respect to year and previous damage marginal_trend_trs.rds: calculated values of the estimated marginal trend with respect to TRS marginal_trend_fib.rds: calculated values of the estimated marginal trend with respect to fiber content Resources in this dataset:Resource Title: Sugarcane borer damage data by internode, 1993-2021. File Name: BoredInternodes_26April2022_no format.csvResource Title: Summary information for the 16 sugarcane varieties analyzed. File Name: variety_lookup.csvResource Title: Variety-level traits for the 16 sugarcane varieties analyzed. File Name: Select_variety_traits_12Dec2022.csvResource Title: RMarkdown notebook 2: trait covariate analysis. File Name: 02_trait_covariate_analysis.RmdResource Title: Rendered HTML output of notebook 2. File Name: 02_trait_covariate_analysis.htmlResource Title: RMarkdown notebook 1: main analysis. File Name: 01_boring_analysis.RmdResource Title: Rendered HTML output of notebook 1. File Name: 01_boring_analysis.htmlResource Title: Intermediate R objects. File Name: intermediate_R_objects.zip
Facebook
TwitterThis data originates from Crossref API. It has metadata on the articles contained in Data Citation Corpus where the citation pair dataset is a DOI.
How to recreate this dataset in Jupyter Notebook:
1) Prepare list of articles to query ```python import pandas as pd
CITATIONS_PARQUET = "data_citation_corpus_filtered_v4.1.parquet"
citation_pairs = pd.read_parquet(CITATIONS_PARQUET)
citation_pairs = citation_pairs[ ~((citation_pairs['dataset'].str.contains("https")) & (~citation_pairs['dataset'].str.contains("doi.org"))) ]
citation_pairs = citation_pairs[ ~citation_pairs['dataset'].str.contains("figshare") ]
citation_pairs['is_doi'] = citation_pairs['dataset'].str.contains('doi.org', na=False)
citation_pairs_doi = citation_pairs[citation_pairs['is_doi'] == True].copy()
articles = list(set(citation_pairs_doi['publication'].to_list()))
articles = [doi.replace("_", "/") for doi in articles]
with open("articles.txt", "w") as f: for article in articles: f.write(f"{article} ") ```
2) Query articles from CrossRef API
%%writefile enrich.py
#!pip install -q aiolimiter
import sys, pathlib, asyncio, aiohttp, orjson, sqlite3, time
from aiolimiter import AsyncLimiter
# ---------- config ----------
HEADERS = {"User-Agent": "ForDataCiteEnrichment (mailto:your_email)"} # Put your email here
MAX_RPS = 45 # polite pool limit (50), leave head-room
BATCH_SIZE = 10_000 # rows per INSERT
DB_PATH = pathlib.Path("crossref.sqlite").resolve()
ARTICLES = pathlib.Path("articles.txt")
# -----------------------------
# ---- platform tweak: prefer selector loop on Windows ----
if sys.platform == "win32":
asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
# ---- read the DOI list ----
with ARTICLES.open(encoding="utf-8") as f:
DOIS = [line.strip() for line in f if line.strip()]
# ---- make sure DB & table exist BEFORE the async part ----
DB_PATH.parent.mkdir(parents=True, exist_ok=True)
with sqlite3.connect(DB_PATH) as db:
db.execute("""
CREATE TABLE IF NOT EXISTS works (
doi TEXT PRIMARY KEY,
json TEXT
)
""")
db.execute("PRAGMA journal_mode=WAL;") # better concurrency
# ---------- async section ----------
limiter = AsyncLimiter(MAX_RPS, 1) # 45 req / second
sem = asyncio.Semaphore(100) # cap overall concurrency
async def fetch_one(session, doi: str):
url = f"https://api.crossref.org/works/{doi}"
async with limiter, sem:
try:
async with session.get(url, headers=HEADERS, timeout=10) as r:
if r.status == 404: # common “not found”
return doi, None
r.raise_for_status() # propagate other 4xx/5xx
return doi, await r.json()
except Exception as e:
return doi, None # log later, don’t crash
async def main():
start = time.perf_counter()
db = sqlite3.connect(DB_PATH) # KEEP ONE connection
db.execute("PRAGMA synchronous = NORMAL;") # speed tweak
async with aiohttp.ClientSession(json_serialize=orjson.dumps) as s:
for chunk_start in range(0, len(DOIS), BATCH_SIZE):
slice_ = DOIS[chunk_start:chunk_start + BATCH_SIZE]
tasks = [asyncio.create_task(fetch_one(s, d)) for d in slice_]
results = await asyncio.gather(*tasks) # all tuples, no exc
good_rows, bad_dois = [], []
for doi, payload in results:
if payload is None:
bad_dois.append(doi)
else:
good_rows.append((doi, orjson.dumps(payload).decode()))
if good_rows:
db.executemany(
"INSERT OR IGNORE INTO works (doi, json) VALUES (?, ?)",
good_rows,
)
db.commit()
if bad_dois: # append for later retry
with open("failures.log", "a", encoding="utf-8") as fh:
fh.writelines(f"{d}
" for d in bad_dois)
done = chunk_start + len(slice_)
rate = done / (time.perf_counter() - start)
print(f"{done:,}/{len(DOIS):,} ({rate:,.1f} DOI/s)")
db.close()
if _name_ == "_main_":
asyncio.run(main())
Then run:
python
!python enrich.py
3) Finally extract the necessary fields
import sqlite3
import orjson
i...
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The Russian Financial Statements Database (RFSD) is an open, harmonized collection of annual unconsolidated financial statements of the universe of Russian firms:
🔓 First open data set with information on every active firm in Russia.
🗂️ First open financial statements data set that includes non-filing firms.
🏛️ Sourced from two official data providers: the Rosstat and the Federal Tax Service.
📅 Covers 2011-2023 initially, will be continuously updated.
🏗️ Restores as much data as possible through non-invasive data imputation, statement articulation, and harmonization.
The RFSD is hosted on 🤗 Hugging Face and Zenodo and is stored in a structured, column-oriented, compressed binary format Apache Parquet with yearly partitioning scheme, enabling end-users to query only variables of interest at scale.
The accompanying paper provides internal and external validation of the data: http://arxiv.org/abs/2501.05841.
Here we present the instructions for importing the data in R or Python environment. Please consult with the project repository for more information: http://github.com/irlcode/RFSD.
Importing The Data
You have two options to ingest the data: download the .parquet files manually from Hugging Face or Zenodo or rely on 🤗 Hugging Face Datasets library.
Python
🤗 Hugging Face Datasets
It is as easy as:
from datasets import load_dataset import polars as pl
RFSD = load_dataset('irlspbru/RFSD')
RFSD_2023 = pl.read_parquet('hf://datasets/irlspbru/RFSD/RFSD/year=2023/*.parquet')
Please note that the data is not shuffled within year, meaning that streaming first n rows will not yield a random sample.
Local File Import
Importing in Python requires pyarrow package installed.
import pyarrow.dataset as ds import polars as pl
RFSD = ds.dataset("local/path/to/RFSD")
print(RFSD.schema)
RFSD_full = pl.from_arrow(RFSD.to_table())
RFSD_2019 = pl.from_arrow(RFSD.to_table(filter=ds.field('year') == 2019))
RFSD_2019_revenue = pl.from_arrow( RFSD.to_table( filter=ds.field('year') == 2019, columns=['inn', 'line_2110'] ) )
renaming_df = pl.read_csv('local/path/to/descriptive_names_dict.csv') RFSD_full = RFSD_full.rename({item[0]: item[1] for item in zip(renaming_df['original'], renaming_df['descriptive'])})
R
Local File Import
Importing in R requires arrow package installed.
library(arrow) library(data.table)
RFSD <- open_dataset("local/path/to/RFSD")
schema(RFSD)
scanner <- Scanner$create(RFSD) RFSD_full <- as.data.table(scanner$ToTable())
scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scanner <- scan_builder$Finish() RFSD_2019 <- as.data.table(scanner$ToTable())
scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scan_builder$Project(cols = c("inn", "line_2110")) scanner <- scan_builder$Finish() RFSD_2019_revenue <- as.data.table(scanner$ToTable())
renaming_dt <- fread("local/path/to/descriptive_names_dict.csv") setnames(RFSD_full, old = renaming_dt$original, new = renaming_dt$descriptive)
Use Cases
🌍 For macroeconomists: Replication of a Bank of Russia study of the cost channel of monetary policy in Russia by Mogiliat et al. (2024) — interest_payments.md
🏭 For IO: Replication of the total factor productivity estimation by Kaukin and Zhemkova (2023) — tfp.md
🗺️ For economic geographers: A novel model-less house-level GDP spatialization that capitalizes on geocoding of firm addresses — spatialization.md
FAQ
Why should I use this data instead of Interfax's SPARK, Moody's Ruslana, or Kontur's Focus?hat is the data period?
To the best of our knowledge, the RFSD is the only open data set with up-to-date financial statements of Russian companies published under a permissive licence. Apart from being free-to-use, the RFSD benefits from data harmonization and error detection procedures unavailable in commercial sources. Finally, the data can be easily ingested in any statistical package with minimal effort.
What is the data period?
We provide financials for Russian firms in 2011-2023. We will add the data for 2024 by July, 2025 (see Version and Update Policy below).
Why are there no data for firm X in year Y?
Although the RFSD strives to be an all-encompassing database of financial statements, end users will encounter data gaps:
We do not include financials for firms that we considered ineligible to submit financial statements to the Rosstat/Federal Tax Service by law: financial, religious, or state organizations (state-owned commercial firms are still in the data).
Eligible firms may enjoy the right not to disclose under certain conditions. For instance, Gazprom did not file in 2022 and we had to impute its 2022 data from 2023 filings. Sibur filed only in 2023, Novatek — in 2020 and 2021. Commercial data providers such as Interfax's SPARK enjoy dedicated access to the Federal Tax Service data and therefore are able source this information elsewhere.
Firm may have submitted its annual statement but, according to the Uniform State Register of Legal Entities (EGRUL), it was not active in this year. We remove those filings.
Why is the geolocation of firm X incorrect?
We use Nominatim to geocode structured addresses of incorporation of legal entities from the EGRUL. There may be errors in the original addresses that prevent us from geocoding firms to a particular house. Gazprom, for instance, is geocoded up to a house level in 2014 and 2021-2023, but only at street level for 2015-2020 due to improper handling of the house number by Nominatim. In that case we have fallen back to street-level geocoding. Additionally, streets in different districts of one city may share identical names. We have ignored those problems in our geocoding and invite your submissions. Finally, address of incorporation may not correspond with plant locations. For instance, Rosneft has 62 field offices in addition to the central office in Moscow. We ignore the location of such offices in our geocoding, but subsidiaries set up as separate legal entities are still geocoded.
Why is the data for firm X different from https://bo.nalog.ru/?
Many firms submit correcting statements after the initial filing. While we have downloaded the data way past the April, 2024 deadline for 2023 filings, firms may have kept submitting the correcting statements. We will capture them in the future releases.
Why is the data for firm X unrealistic?
We provide the source data as is, with minimal changes. Consider a relatively unknown LLC Banknota. It reported 3.7 trillion rubles in revenue in 2023, or 2% of Russia's GDP. This is obviously an outlier firm with unrealistic financials. We manually reviewed the data and flagged such firms for user consideration (variable outlier), keeping the source data intact.
Why is the data for groups of companies different from their IFRS statements?
We should stress that we provide unconsolidated financial statements filed according to the Russian accounting standards, meaning that it would be wrong to infer financials for corporate groups with this data. Gazprom, for instance, had over 800 affiliated entities and to study this corporate group in its entirety it is not enough to consider financials of the parent company.
Why is the data not in CSV?
The data is provided in Apache Parquet format. This is a structured, column-oriented, compressed binary format allowing for conditional subsetting of columns and rows. In other words, you can easily query financials of companies of interest, keeping only variables of interest in memory, greatly reducing data footprint.
Version and Update Policy
Version (SemVer): 1.0.0.
We intend to update the RFSD annualy as the data becomes available, in other words when most of the firms have their statements filed with the Federal Tax Service. The official deadline for filing of previous year statements is April, 1. However, every year a portion of firms either fails to meet the deadline or submits corrections afterwards. Filing continues up to the very end of the year but after the end of April this stream quickly thins out. Nevertheless, there is obviously a trade-off between minimization of data completeness and version availability. We find it a reasonable compromise to query new data in early June, since on average by the end of May 96.7% statements are already filed, including 86.4% of all the correcting filings. We plan to make a new version of RFSD available by July.
Licence
Creative Commons License Attribution 4.0 International (CC BY 4.0).
Copyright © the respective contributors.
Citation
Please cite as:
@unpublished{bondarkov2025rfsd, title={{R}ussian {F}inancial {S}tatements {D}atabase}, author={Bondarkov, Sergey and Ledenev, Victor and Skougarevskiy, Dmitriy}, note={arXiv preprint arXiv:2501.05841}, doi={https://doi.org/10.48550/arXiv.2501.05841}, year={2025}}
Acknowledgments and Contacts
Data collection and processing: Sergey Bondarkov, sbondarkov@eu.spb.ru, Viktor Ledenev, vledenev@eu.spb.ru
Project conception, data validation, and use cases: Dmitriy Skougarevskiy, Ph.D.,
Facebook
TwitterThis dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago from 2001 to present, minus the most recent seven days. Data is extracted from the Chicago Police Department's CLEAR (Citizen Law Enforcement Analysis and Reporting) system. In order to protect the privacy of crime victims, addresses are shown at the block level only and specific locations are not identified. Should you have questions about this dataset, you may contact the Research & Development Division of the Chicago Police Department at PSITAdministration@ChicagoPolice.org. Disclaimer: These crimes may be based upon preliminary information supplied to the Police Department by the reporting parties that have not been verified. The preliminary crime classifications may be changed at a later date based upon additional investigation and there is always the possibility of mechanical or human error. Therefore, the Chicago Police Department does not guarantee (either expressed or implied) the accuracy, completeness, timeliness, or correct sequencing of the information and the information should not be used for comparison purposes over time. The Chicago Police Department will not be responsible for any error or omission, or for the use of, or the results obtained from the use of this information. All data visualizations on maps should be considered approximate and attempts to derive specific addresses are strictly prohibited. The Chicago Police Department is not responsible for the content of any off-site pages that are referenced by or that reference this web page other than an official City of Chicago or Chicago Police Department web page. The user specifically acknowledges that the Chicago Police Department is not responsible for any defamatory, offensive, misleading, or illegal conduct of other users, links, or third parties and that the risk of injury from the foregoing rests entirely with the user. The unauthorized use of the words "Chicago Police Department," "Chicago Police," or any colorable imitation of these words or the unauthorized use of the Chicago Police Department logo is unlawful. This web page does not, in any way, authorize such use. Data are updated daily. To access a list of Chicago Police Department - Illinois Uniform Crime Reporting (IUCR) codes, go to http://data.cityofchicago.org/Public-Safety/Chicago-Police-Department-Illinois-Uniform-Crime-R/c7ck-438e
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
DATA TABLES FOR RUNNING THE SCRIPTS"Dataset.csv" : this file contains the ASV table refined from the dataset generated by Tournayre et al. 2025. The refinement process involved retaining 1) only the 2022 data, and 2) only birds (excluding exotic caged birds such as budgerigars, cockatiels, and grey parrots), mammals (including domestic ones), insects, non-algal plants, and fungi (except for two phyla represented by three ASVs that were rarely detected). The columns "Supergroup", "group", "phylum", "class", "Order", "Family", "Genus", and "species" provide the detailed taxonomy of each ASV. The column "FINAL_ID" indicates the final ASV identification at the most precise taxonomic level, using the following prefixes: "c:" for class, "p:" for phylum, "o:" for order, "f:" for family, "g:" for genus, and "s:" for species. Sample names are coded as follows: the first four letters represent the site, followed by the sampling year and month. For example, "AUCH_2022_apr" corresponds to a sample collected at Auchencorth Moss in April 2022."Dataset_no-domestic-mammals.csv" : Same dataset as "Dataset.csv" described above, minus the domestic mammals: rabbit, sheep, horse, donkey, cat, dog, goat and pig."CorineMetal.csv": this dataset contains CORINE Land Cover 2018 surface data (Cole et al., 2021, add reference) and heavy metal pollutant concentrations (© Crown 2025) used to perform the land cover – pollution PCA. Land cover categories were aggregated into five main types based on the CORINE land cover classification: Artificial surfaces (codes: 111, 112, 121–124, 131–133, 141, 142), Agricultural areas (codes: 211, 222, 231, 242, 243), Forest and semi-natural areas (codes: 311–313, 321–324, 331, 333), Wetlands (codes: 411, 412, 421, 423), Water bodies (codes: 511, 512, 522, 523). Pollutant data for 2022 were obtained from the DEFRA UK AIR Information Resource (https://uk-air.defra.gov.uk/data/) (© Crown 2025). These include the average concentrations (ng/m³) of metals primarily emitted from fossil fuel combustion and industrial processes: arsenic (As), cadmium (Cd), cobalt (Co), copper (Cu), iron (Fe), manganese (Mn), nickel (Ni), lead (Pb), selenium (Se), vanadium (V), and zinc (Zn). The dataset provides site-averaged values for each metal. The "Environment_type" column specifies the environment type of UK Heavy Metal air quality monitoring sites, as defined on the UK AIR website (https://uk-air.defra.gov.uk/networks/site-types)."hfp2013_merisINT.tif" : a raster file used to extract average values from the 2013 Human Footprint maps (Williams et al., 2020)."BetaComp_output_nodomestic.csv" : a dataset containing beta diversity component outputs (total beta diversity "BDTotal", turnover, and nestedness) calculated for each site and urban vs. rural environments using the beta.div.comp function (coeff = "BJ", Baselga Jaccard-based) from the adespatial R package (Dray et al., 2024). This dataset is used to generate Figure S3. Note that it excludes domestic mammal species."Site_coordinates_info.csv" : a table providing site details, including: Full site name ("Site"), Site code ("SITE_ID"), UK-AIR Station ID ("UK-AIR ID"), Geographic coordinates ("Latitude", "Longitude"), Label position adjustments for visualization ("nudge_x", "nudge_y"), Environment type (Urban vs. Rural) based on the UK AIR Information Resource.R SCRIPTS TO RUN THE ANALYSES"Analyses.R" : this script performs the alpha and beta diversity analyses. Note that plots for different taxonomic groups were generated separately and later manually combined into the final figures."Make_The_Map.R" : this script generates the map used in the top panel of Figure 1.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ABSTRACT
The issue of diagnosing psychotic diseases, including schizophrenia and bipolar disorder, in particular, the objectification of symptom severity assessment, is still a problem requiring the attention of researchers. Two measures that can be helpful in patient diagnosis are heart rate variability calculated based on electrocardiographic signal and accelerometer mobility data. The following dataset contains data from 30 psychiatric ward patients having schizophrenia or bipolar disorder and 30 healthy persons. The duration of the measurements for individuals was usually between 1.5 and 2 hours. R-R intervals necessary for heart rate variability calculation were collected simultaneously with accelerometer data using a wearable Polar H10 device. The Positive and Negative Syndrome Scale (PANSS) test was performed for each patient participating in the experiment, and its results were attached to the dataset. Furthermore, the code for loading and preprocessing data, as well as for statistical analysis, was included on the corresponding GitHub repository.
BACKGROUND
Heart rate variability (HRV), calculated based on electrocardiographic (ECG) recordings of R-R intervals stemming from the heart's electrical activity, may be used as a biomarker of mental illnesses, including schizophrenia and bipolar disorder (BD) [Benjamin et al]. The variations of R-R interval values correspond to the heart's autonomic regulation changes [Berntson et al, Stogios et al]. Moreover, the HRV measure reflects the activity of the sympathetic and parasympathetic parts of the autonomous nervous system (ANS) [Task Force of the European Society of Cardiology the North American Society of Pacing Electrophysiology, Matusik et al]. Patients with psychotic mental disorders show a tendency for a change in the centrally regulated ANS balance in the direction of less dynamic changes in the ANS activity in response to different environmental conditions [Stogios et al]. Larger sympathetic activity relative to the parasympathetic one leads to lower HRV, while, on the other hand, higher parasympathetic activity translates to higher HRV. This loss of dynamic response may be an indicator of mental health. Additional benefits may come from measuring the daily activity of patients using accelerometry. This may be used to register periods of physical activity and inactivity or withdrawal for further correlation with HRV values recorded at the same time.
EXPERIMENTS
In our experiment, the participants were 30 psychiatric ward patients with schizophrenia or BD and 30 healthy people. All measurements were performed using a Polar H10 wearable device. The sensor collects ECG recordings and accelerometer data and, additionally, prepares a detection of R wave peaks. Participants of the experiment had to wear the sensor for a given time. Basically, it was between 1.5 and 2 hours, but the shortest recording was 70 minutes. During this time, evaluated persons could perform any activity a few minutes after starting the measurement. Participants were encouraged to undertake physical activity and, more specifically, to take a walk. Due to patients being in the medical ward, they received instruction to take a walk in the corridors at the beginning of the experiment. They were to repeat the walk 30 minutes and 1 hour after the first walk. The subsequent walks were to be slightly longer (about 3, 5 and 7 minutes, respectively). We did not remind or supervise the command during the experiment, both in the treatment and the control group. Seven persons from the control group did not receive this order and their measurements correspond to freely selected activities with rest periods but at least three of them performed physical activities during this time. Nevertheless, at the start of the experiment, all participants were requested to rest in a sitting position for 5 minutes. Moreover, for each patient, the disease severity was assessed using the PANSS test and its scores are attached to the dataset.
The data from sensors were collected using Polar Sensor Logger application [Happonen]. Such extracted measurements were then preprocessed and analyzed using the code prepared by the authors of the experiment. It is publicly available on the GitHub repository [Książek et al].
Firstly, we performed a manual artifact detection to remove abnormal heartbeats due to non-sinus beats and technical issues of the device (e.g. temporary disconnections and inappropriate electrode readings). We also performed anomaly detection using Daubechies wavelet transform. Nevertheless, the dataset includes raw data, while a full code necessary to reproduce our anomaly detection approach is available in the repository. Optionally, it is also possible to perform cubic spline data interpolation. After that step, rolling windows of a particular size and time intervals between them are created. Then, a statistical analysis is prepared, e.g. mean HRV calculation using the RMSSD (Root Mean Square of Successive Differences) approach, measuring a relationship between mean HRV and PANSS scores, mobility coefficient calculation based on accelerometer data and verification of dependencies between HRV and mobility scores.
DATA DESCRIPTION
The structure of the dataset is as follows. One folder, called HRV_anonymized_data contains values of R-R intervals together with timestamps for each experiment participant. The data was properly anonymized, i.e. the day of the measurement was removed to prevent person identification. Files concerned with patients have the name treatment_X.csv, where X is the number of the person, while files related to the healthy controls are named control_Y.csv, where Y is the identification number of the person. Furthermore, for visualization purposes, an image of the raw RR intervals for each participant is presented. Its name is raw_RR_{control,treatment}_N.png, where N is the number of the person from the control/treatment group. The collected data are raw, i.e. before the anomaly removal. The code enabling reproducing the anomaly detection stage and removing suspicious heartbeats is publicly available in the repository [Książek et al]. The structure of consecutive files collecting R-R intervals is following:
Phone timestamp
RR-interval [ms]
12:43:26.538000
651
12:43:27.189000
632
12:43:27.821000
618
12:43:28.439000
621
12:43:29.060000
661
...
...
The first column contains the timestamp for which the distance between two consecutive R peaks was registered. The corresponding R-R interval is presented in the second column of the file and is expressed in milliseconds.
The second folder, called accelerometer_anonymized_data contains values of accelerometer data collected at the same time as R-R intervals. The naming convention is similar to that of the R-R interval data: treatment_X.csv and control_X.csv represent the data coming from the persons from the treatment and control group, respectively, while X is the identification number of the selected participant. The numbers are exactly the same as for R-R intervals. The structure of the files with accelerometer recordings is as follows:
Phone timestamp
X [mg]
Y [mg]
Z [mg]
13:00:17.196000
-961
-23
182
13:00:17.205000
-965
-21
181
13:00:17.215000
-966
-22
187
13:00:17.225000
-967
-26
193
13:00:17.235000
-965
-27
191
...
...
...
...
The first column contains a timestamp, while the next three columns correspond to the currently registered acceleration in three axes: X, Y and Z, in milli-g unit.
We also attached a file with the PANSS test scores (PANSS.csv) for all patients participating in the measurement. The structure of this file is as follows:
no_of_person
PANSS_P
PANSS_N
PANSS_G
PANSS_total
1
8
13
22
43
2
11
7
18
36
3
14
30
44
88
4
18
13
27
58
...
...
...
...
..
The first column contains the identification number of the patient, while the three following columns refer to the PANSS scores related to positive, negative and general symptoms, respectively.
USAGE NOTES
All the files necessary to run the HRV and/or accelerometer data analysis are available on the GitHub repository [Książek et al]. HRV data loading, preprocessing (i.e. anomaly detection and removal), as well as the calculation of mean HRV values in terms of the RMSSD, is performed in the main.py file. Also, Pearson's correlation coefficients between HRV values and PANSS scores and the statistical tests (Levene's and Mann-Whitney U tests) comparing the treatment and control groups are computed. By default, a sensitivity analysis is made, i.e. running the full pipeline for different settings of the window size for which the HRV is calculated and various time intervals between consecutive windows. Preparing the heatmaps of correlation coefficients and corresponding p-values can be done by running the utils_advanced_plots.py file after performing the sensitivity analysis. Furthermore, a detailed analysis for the one selected set of hyperparameters may be prepared (by setting sensitivity_analysis = False), i.e. for 15-minute window sizes, 1-minute time intervals between consecutive windows and without data interpolation method. Also, patients taking quetiapine may be excluded from further calculations by setting exclude_quetiapine = True because this medicine can have a strong impact on HRV [Hattori et al].
The accelerometer data processing may be performed using the utils_accelerometer.py file. In this case, accelerometer recordings are downsampled to ensure the same timestamps as for R-R intervals and, for each participant, the mobility coefficient is calculated. Then, a correlation
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The human body is an outstandingly complex machine including around 1000 muscles and joints acting synergistically. Yet, the coordination of the enormous amount of degrees of freedom needed for movement is mastered by our one brain and spinal cord. The idea that some synergistic neural components of movement exist was already suggested at the beginning of the XX century. Since then, it has been widely accepted that the central nervous system might simplify the production of movement by avoiding the control of each muscle individually. Instead, it might be controlling muscles in common patterns that have been called muscle synergies. Only with the advent of modern computational methods and hardware it has been possible to numerically extract synergies from electromyography (EMG) signals. However, typical experimental setups do not include a big number of individuals, with common sample sizes of five to 20 participants. With this study, we make publicly available a set of EMG activities recorded during treadmill running from the right lower limb of 135 healthy and young adults (78 males, 57 females). Moreover, we include in this open access data set the code used to extract synergies from EMG data using non-negative matrix factorization and the relative outcomes. Muscle synergies, containing the time-invariant muscle weightings (motor modules) and the time-dependent activation coefficients (motor primitives), were extracted from 13 ipsilateral EMG activities using non-negative matrix factorization. Four synergies were enough to describe as many gait cycle phases during running: weight acceptance, propulsion, early swing and late swing. We foresee many possible applications of our data, that we can summarize in three key points. First, it can be a prime source for broadening the representation of human motor control due to the big sample size. Second, it could serve as a benchmark for scientists from multiple disciplines such as musculoskeletal modelling, robotics, clinical neuroscience, sport science, etc. Third, the data set could be used both to train students or to support established scientists in the perfection of current muscle synergies extraction methods.
The "RAW_DATA.RData" R list consists of elements of S3 class "EMG", each of which is a human locomotion trial containing cycle segmentation timings and raw electromyographic (EMG) data from 13 muscles of the right-side leg. Cycle times are structured as data frames containing two columns that correspond to touchdown (first column) and lift-off (second column). Raw EMG data sets are also structured as data frames with one row for each recorded data point and 14 columns. The first column contains the incremental time in seconds. The remaining 13 columns contain the raw EMG data, named with the following muscle abbreviations: ME = gluteus medius, MA = gluteus maximus, FL = tensor fasciæ latæ, RF = rectus femoris, VM = vastus medialis, VL = vastus lateralis, ST = semitendinosus, BF = biceps femoris, TA = tibialis anterior, PL = peroneus longus, GM = gastrocnemius medialis, GL = gastrocnemius lateralis, SO = soleus.
The file "dataset.rar" contains data in older format, not compatible with the R package musclesyneRgies.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Original dataset The original year-2019 dataset was downloaded from the World Bank Databank by the following approach on July 23, 2022.
Database: "World Development Indicators" Country: 266 (all available) Series: "CO2 emissions (kt)", "GDP (current US$)", "GNI, Atlas method (current US$)", and "Population, total" Time: 1960, 1970, 1980, 1990, 2000, 2010, 2017, 2018, 2019, 2020, 2021 Layout: Custom -> Time: Column, Country: Row, Series: Column Download options: Excel
Preprocessing
With libreoffice,
remove non-country entries (lines after Zimbabwe), shorten column names for easy processing: Country Name -> Country, Country Code -> Code, "XXXX ... GNI ..." -> GNI_1990, etc (notice '_', not '-', for R), remove unnesssary rows after line Zimbabwe.
Facebook
Twitter**************** NTU Pedestrian Dataset ******************* Attached files contain our data collected inside Nanyang Technological University Campus for pedestrian intention prediction. The dataset is particularly designed to capture spontaneous vehicle influences on pedestrian crossing/not-crossing intention. We utilize this dataset in our journal paper "Context Model for Pedestrian Intention Prediction using Factored Latent-Dynamic Conditional Random Fields" accepted by IEEE Transactions on Intelligent Transportation Systems. The dataset consists of 35 crossing and 35 stopping* (not-crossing) scenarios. The image sequences are in 'Image_sequences' folder. 'stopping_instants.csv' and 'crossing_instants.csv' files provide the stopping and crossing instants respectively, utilized for labeling the data and providing ground-truth for evaluation. Camera1 and Camera2 images are synchronized. Two cameras were used to capture the whole scene of interest. We provide pedestrian and vehicle bounding boxes obtained from [1]. The occlusions and mis-detections are linearly interpolated. All necessary detections are stored in 'Object_detector_pedestrians_vehicles' folder. Each column within the csv files ('car_bndbox_..') corresponds to a unique tracked car within each image sequence. Each of the pedestrian csv files ('ped_bndbox_..') contains only one column, as we consider each pedestrian in the scene separately. Additional details:* [xmin xmax ymin ymax] = left right top down* Dataset frequency: 15 fps.* Camera parameters (in pixels): f = 1135, principal point = (960, 540). Additionally, we provide semantic segmentation output [2] and our depth parameters. As the data were collected in two phases, there are two files in each folder, highlighting the sequences in each phase. Crossing sequences 1-28 and stopping sequences 1-24 were collected in Phase 1, while crossing sequences 29-35 and stopping sequences 25-35 were collected in Phase 2. We obtained the optical flow from [3]. Our model (FLDCRF) codes are available here: https://github.com/satyajitneogiju/FLDCRF-for-sequence-labeling If you use our dataset in your research, please cite our paper(s):1. S. Neogi, M. Hoy, K. Dang, H. Yu, J. Dauwels, "Context Model for Pedestrian Intention Prediction using Factored Latent-Dynamic Conditional Random Fields". Accepted by IEEE Transactions on Intelligent Transportation Systems (T-ITS), 2019. 2. "S. Neogi, M. Hoy, W. Chaoqun, J. Dauwels, 'Context Based Pedestrian Intention Prediction Using Factored Latent Dynamic Conditional Random Fields', IEEE SSCI-2017." Please email us if you have any questions: 1. Satyajit Neogi, PhD Student, Nanyang Technological University @ satyajit001@e.ntu.edu.sg2. Justin Dauwels, Associate Professor, Nanyang Technological University @ jdauwels@ntu.edu.sg Our other group members include: 3. Dr. Michael Hoy, @ mch.hoy@gmail.com4. Dr. Kang Dang, @ kangdang@gmail.com5. Ms. Lakshmi Prasanna Kachireddy,6. Mr. Mok Bo Chuan Lance, and7. Mr. Xu Yan References: 1. S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks", NIPS 2015.2\. A. Kendall, V. Badrinarayanan, R. Cipolla,Bayesian SegNet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures for Scene Understanding", BMVC 2017.3. C. Liu. ``Beyond Pixels: Exploring New Representations and Applications for Motion Analysis". Doctoral Thesis. Massachusetts Institute of Technology. May 2009. * Please note, we had to remove sequence Stopping-33 for privacy reasons. #### FUNDINGSTE-NTU NRF corporate lab@university scheme
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This replication package accompagnies the dataset and exploratory empirical analysis reported in the paper "A dataset of GitHub Actions workflow histories" published in the IEEE MSR 2024 conference. (The Jupyter notebook can be found in previous version of this dataset).
Important notice : It looks like Zenodo is compressing gzipped files two times without notice, they are "double compressed". So, when you download them they should be named : x.gz.gz instead of x.gz. Notice that the provided MD5 refers to the original file.
2024-10-25 update : updated repositories list and observation period. The filters relying on date were also updated.
2024-07-09 update : fix sometimes invalid valid_yaml flag.
The dataset was created as follow :
First, we used GitHub SEART (on October 7th, 2024) to get a list of every non-fork repositories created before January 1st, 2024. having at least 300 commits and at least 100 stars where at least one commit was made after January 1st, 2024. (The goal of these filter is to exclude experimental and personnal repositories).
We checked if a folder .github/workflows existed. We filtered out those that did not contained this folder and pulled the others (between 9th and 10thof October 2024).
We applied the tool gigawork (version 1.4.2) to extract every files from this folder. The exact command used is python batch.py -d /ourDataFolder/repositories -e /ourDataFolder/errors -o /ourDataFolder/output -r /ourDataFolder/repositories_everything.csv.gz -- -w /ourDataFolder/workflows_auxiliaries. (The script batch.py can be found on GitHub).
We concatenated every files in /ourDataFolder/output into a csv (using cat headers.csv output/*.csv > workflows_auxiliaries.csv in /ourDataFolder) and compressed it.
We added the column uid via a script available on GitHub.
Finally, we archived the folder with pigz /ourDataFolder/workflows (tar -c --use-compress-program=pigz -f workflows_auxiliaries.tar.gz /ourDataFolder/workflows)
Using the extracted data, the following files were created :
workflows.tar.gz contains the dataset of GitHub Actions workflow file histories.
workflows_auxiliaries.tar.gz is a similar file containing also auxiliary files.
workflows.csv.gz contains the metadata for the extracted workflow files.
workflows_auxiliaries.csv.gz is a similar file containing also metadata for auxiliary files.
repositories.csv.gz contains metadata about the GitHub repositories containing the workflow files. These metadata were extracted using the SEART Search tool.
The metadata is separated in different columns:
repository: The repository (author and repository name) from which the workflow was extracted. The separator "/" allows to distinguish between the author and the repository name
commit_hash: The commit hash returned by git
author_name: The name of the author that changed this file
author_email: The email of the author that changed this file
committer_name: The name of the committer
committer_email: The email of the committer
committed_date: The committed date of the commit
authored_date: The authored date of the commit
file_path: The path to this file in the repository
previous_file_path: The path to this file before it has been touched
file_hash: The name of the related workflow file in the dataset
previous_file_hash: The name of the related workflow file in the dataset, before it has been touched
git_change_type: A single letter (A,D, M or R) representing the type of change made to the workflow (Added, Deleted, Modified or Renamed). This letter is given by gitpython and provided as is.
valid_yaml: A boolean indicating if the file is a valid YAML file.
probably_workflow: A boolean representing if the file contains the YAML key on and jobs. (Note that it can still be an invalid YAML file).
valid_workflow: A boolean indicating if the file respect the syntax of GitHub Actions workflow. A freely available JSON Schema (used by gigawork) was used in this goal.
uid: Unique identifier for a given file surviving modifications and renames. It is generated on the addition of the file and stays the same until the file is deleted. Renamings does not change the identifier.
Both workflows.csv.gz and workflows_auxiliaries.csv.gz are following this format.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
#https://www.kaggle.com/c/facial-keypoints-detection/details/getting-started-with-r #################################
###Variables for downloaded files data.dir <- ' ' train.file <- paste0(data.dir, 'training.csv') test.file <- paste0(data.dir, 'test.csv') #################################
###Load csv -- creates a data.frame matrix where each column can have a different type. d.train <- read.csv(train.file, stringsAsFactors = F) d.test <- read.csv(test.file, stringsAsFactors = F)
###In training.csv, we have 7049 rows, each one with 31 columns. ###The first 30 columns are keypoint locations, which R correctly identified as numbers. ###The last one is a string representation of the image, identified as a string.
###To look at samples of the data, uncomment this line:
###Let's save the first column as another variable, and remove it from d.train: ###d.train is our dataframe, and we want the column called Image. ###Assigning NULL to a column removes it from the dataframe
im.train <- d.train$Image d.train$Image <- NULL #removes 'image' from the dataframe
im.test <- d.test$Image d.test$Image <- NULL #removes 'image' from the dataframe
################################# #The image is represented as a series of numbers, stored as a string #Convert these strings to integers by splitting them and converting the result to integer
#strsplit splits the string #unlist simplifies its output to a vector of strings #as.integer converts it to a vector of integers. as.integer(unlist(strsplit(im.train[1], " "))) as.integer(unlist(strsplit(im.test[1], " ")))
###Install and activate appropriate libraries ###The tutorial is meant for Linux and OSx, where they use a different library, so: ###Replace all instances of %dopar% with %do%.
library("foreach", lib.loc="~/R/win-library/3.3")
###implement parallelization im.train <- foreach(im = im.train, .combine=rbind) %do% { as.integer(unlist(strsplit(im, " "))) } im.test <- foreach(im = im.test, .combine=rbind) %do% { as.integer(unlist(strsplit(im, " "))) } #The foreach loop will evaluate the inner command for each row in im.train, and combine the results with rbind (combine by rows). #%do% instructs R to do all evaluations in parallel. #im.train is now a matrix with 7049 rows (one for each image) and 9216 columns (one for each pixel):
###Save all four variables in data.Rd file ###Can reload them at anytime with load('data.Rd')
#each image is a vector of 96*96 pixels (96*96 = 9216). #convert these 9216 integers into a 96x96 matrix: im <- matrix(data=rev(im.train[1,]), nrow=96, ncol=96)
#im.train[1,] returns the first row of im.train, which corresponds to the first training image. #rev reverse the resulting vector to match the interpretation of R's image function #(which expects the origin to be in the lower left corner).
#To visualize the image we use R's image function: image(1:96, 1:96, im, col=gray((0:255)/255))
#Let’s color the coordinates for the eyes and nose points(96-d.train$nose_tip_x[1], 96-d.train$nose_tip_y[1], col="red") points(96-d.train$left_eye_center_x[1], 96-d.train$left_eye_center_y[1], col="blue") points(96-d.train$right_eye_center_x[1], 96-d.train$right_eye_center_y[1], col="green")
#Another good check is to see how variable is our data. #For example, where are the centers of each nose in the 7049 images? (this takes a while to run): for(i in 1:nrow(d.train)) { points(96-d.train$nose_tip_x[i], 96-d.train$nose_tip_y[i], col="red") }
#there are quite a few outliers -- they could be labeling errors. Looking at one extreme example we get this: #In this case there's no labeling error, but this shows that not all faces are centralized idx <- which.max(d.train$nose_tip_x) im <- matrix(data=rev(im.train[idx,]), nrow=96, ncol=96) image(1:96, 1:96, im, col=gray((0:255)/255)) points(96-d.train$nose_tip_x[idx], 96-d.train$nose_tip_y[idx], col="red")
#One of the simplest things to try is to compute the mean of the coordinates of each keypoint in the training set and use that as a prediction for all images colMeans(d.train, na.rm=T)
#To build a submission file we need to apply these computed coordinates to the test instances: p <- matrix(data=colMeans(d.train, na.rm=T), nrow=nrow(d.test), ncol=ncol(d.train), byrow=T) colnames(p) <- names(d.train) predictions <- data.frame(ImageId = 1:nrow(d.test), p) head(predictions)
#The expected submission format has one one keypoint per row, but we can easily get that with the help of the reshape2 library:
library(...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
All extracted data files are in the data folder.
Facebook
Twitterhttps://doi.org/10.5061/dryad.brv15dvh0
On each trial, participants heard a stimulus and clicked a box on the computer screen to indicate whether they heard "SET" or "SAT." Responses of "SET" are coded as 0 and responses of "SAT" are coded as 1. The continuum steps, from 1-7, for duration and spectral quality cues of the stimulus on each trial are named "DurationStep" and "SpectralStep," respectively. Group (young or older adult) and listening condition (quiet or noise) information are provided for each row of the dataset.