Facebook
TwitterVersion 5 release notes:
Removes support for SPSS and Excel data.Changes the crimes that are stored in each file. There are more files now with fewer crimes per file. The files and their included crimes have been updated below.
Adds in agencies that report 0 months of the year.Adds a column that indicates the number of months reported. This is generated summing up the number of unique months an agency reports data for. Note that this indicates the number of months an agency reported arrests for ANY crime. They may not necessarily report every crime every month. Agencies that did not report a crime with have a value of NA for every arrest column for that crime.Removes data on runaways.
Version 4 release notes:
Changes column names from "poss_coke" and "sale_coke" to "poss_heroin_coke" and "sale_heroin_coke" to clearly indicate that these column includes the sale of heroin as well as similar opiates such as morphine, codeine, and opium. Also changes column names for the narcotic columns to indicate that they are only for synthetic narcotics.
Version 3 release notes:
Add data for 2016.Order rows by year (descending) and ORI.Version 2 release notes:
Fix bug where Philadelphia Police Department had incorrect FIPS county code.
The Arrests by Age, Sex, and Race data is an FBI data set that is part of the annual Uniform Crime Reporting (UCR) Program data. This data contains highly granular data on the number of people arrested for a variety of crimes (see below for a full list of included crimes). The data sets here combine data from the years 1980-2015 into a single file. These files are quite large and may take some time to load.
All the data was downloaded from NACJD as ASCII+SPSS Setup files and read into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. For the R code used to clean this data, see here. https://github.com/jacobkap/crime_data. If you have any questions, comments, or suggestions please contact me at jkkaplan6@gmail.com.
I did not make any changes to the data other than the following. When an arrest column has a value of "None/not reported", I change that value to zero. This makes the (possible incorrect) assumption that these values represent zero crimes reported. The original data does not have a value when the agency reports zero arrests other than "None/not reported." In other words, this data does not differentiate between real zeros and missing values. Some agencies also incorrectly report the following numbers of arrests which I change to NA: 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 99999, 99998.
To reduce file size and make the data more manageable, all of the data is aggregated yearly. All of the data is in agency-year units such that every row indicates an agency in a given year. Columns are crime-arrest category units. For example, If you choose the data set that includes murder, you would have rows for each agency-year and columns with the number of people arrests for murder. The ASR data breaks down arrests by age and gender (e.g. Male aged 15, Male aged 18). They also provide the number of adults or juveniles arrested by race. Because most agencies and years do not report the arrestee's ethnicity (Hispanic or not Hispanic) or juvenile outcomes (e.g. referred to adult court, referred to welfare agency), I do not include these columns.
To make it easier to merge with other data, I merged this data with the Law Enforcement Agency Identifiers Crosswalk (LEAIC) data. The data from the LEAIC add FIPS (state, county, and place) and agency type/subtype. Please note that some of the FIPS codes have leading zeros and if you open it in Excel it will automatically delete those leading zeros.
I created 9 arrest categories myself. The categories are:
Total Male JuvenileTotal Female JuvenileTotal Male AdultTotal Female AdultTotal MaleTotal FemaleTotal JuvenileTotal AdultTotal ArrestsAll of these categories are based on the sums of the sex-age categories (e.g. Male under 10, Female aged 22) rather than using the provided age-race categories (e.g. adult Black, juvenile Asian). As not all agencies report the race data, my method is more accurate. These categories also make up the data in the "simple" version of the data. The "simple" file only includes the above 9 columns as the arrest data (all other columns in the data are just agency identifier columns). Because this "simple" data set need fewer columns, I include all offenses.
As the arrest data is very granular, and each category of arrest is its own column, there are dozens of columns per crime. To keep the data somewhat manageable, there are nine different files, eight which contain different crimes and the "simple" file. Each file contains the data for all years. The eight categories each have crimes belonging to a major crime category and do not overlap in crimes other than with the index offenses. Please note that the crime names provided below are not the same as the column names in the data. Due to Stata limiting column names to 32 characters maximum, I have abbreviated the crime names in the data. The files and their included crimes are:
Index Crimes
MurderRapeRobberyAggravated AssaultBurglaryTheftMotor Vehicle TheftArsonAlcohol CrimesDUIDrunkenness
LiquorDrug CrimesTotal DrugTotal Drug SalesTotal Drug PossessionCannabis PossessionCannabis SalesHeroin or Cocaine PossessionHeroin or Cocaine SalesOther Drug PossessionOther Drug SalesSynthetic Narcotic PossessionSynthetic Narcotic SalesGrey Collar and Property CrimesForgeryFraudStolen PropertyFinancial CrimesEmbezzlementTotal GamblingOther GamblingBookmakingNumbers LotterySex or Family CrimesOffenses Against the Family and Children
Other Sex Offenses
ProstitutionRapeViolent CrimesAggravated AssaultMurderNegligent ManslaughterRobberyWeapon Offenses
Other CrimesCurfewDisorderly ConductOther Non-trafficSuspicion
VandalismVagrancy
Simple
This data set has every crime and only the arrest categories that I created (see above).
If you have any questions, comments, or suggestions please contact me at jkkaplan6@gmail.com.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The Comprehensive R Archive Network (CRAN) is the central repository for software packages in the powerful R programming language for statistical computing. It describes itself as "a network of ftp and web servers around the world that store identical, up-to-date, versions of code and documentation for R." If you're installing an R package in the standard way then it is provided by one of the CRAN mirrors.
The ecosystem of R packages continues to grow at an accelerated pace, covering a multitude of aspects of statistics, machine learning, data visualisation, and many other areas. This dataset provides monthly updates of all the packages available through CRAN, as well as their release histories. Explore the evolution of the R multiverse and all of its facets through this comprehensive data.
I'm providing 2 csv tables that describe the current set of R packages on CRAN, as well as the version history of these packages. To derive the data, I made use of the fantastic functionality of the tools package, via the CRAN_package_db function, and the equally wonderful packageRank package and its packageHistory function. The results from those function were slightly adjusted and formatted. I might add further related tables over time.
See the associated blog post for how the data was derived, and for some ideas on how to explore this dataset.
These are the tables contained in this dataset:
cran_package_overview.csv: all R packages currently available through CRAN, with (usually) 1 row per package. (At the time of the creation of this Kaggle dataset there were a few packages with 2 entries and different dependencies. Feel free to contribute some EDA investigating those.) Packages are listed in alphabetical order according to their names.
cran_package_history.csv: version history of virtually all packages in the previous table. This table has one row for each combination of package name and version number, which in most cases leads to multiple rows per package. Packages are listed in alphabetical order according to their names.
I will update this dataset on a roughly monthly cadence by checking which packages have newer version in the overview table, and then replacing
Table cran_package_overview.csv: I decided to simplify the large number of columns provided by CRAN and tools::CRAN_package_db into a smaller set of more focus features. All columns are formatted as strings, except for the boolean feature needs_compilation, but the date_published can be read as a ymd date:
package: package name following the official spelling and capitalisation. Table is sorted alphabetically according to this column.version: current version.depends: package depends on which other packages.imports: package imports which other packages.licence: the licence under which the package is distributed (e.g. GPL versions)needs_compilation: boolean feature describing whether the package needs to be compiled.author: package author.bug_reports: where to send bugs.url: where to read more.date_published: when the current version of the package was published. Note: this is not the date of the initial package release. See the package history table for that.description: relatively detailed description of what the package is doing.title: the title and tagline of the package.Table cran_package_history.csv: The output of packageRank::packageHistory for each package from the overview table. Almost all of them have a match in this table, and can be matched by package and version. All columns are strings, and the date can again be parsed as a ymd date:
package: package name. Joins to the feature of the same name in the overview table. Table is sorted alphabetically according to this column.version: historical or current package version. Also joins. Secondary sorting column within each package name.date: when this version was published. Should sort in the same way as the version does.repository: on CRAN or in the Archive.All data is being made publicly available by the Comprehensive R Archive Network (CRAN). I'm grateful to the authors and maintainers of the packages tools and packageRank for providing the functionality to query CRAN packages smoothly and easily.
The vignette photo is the official logo for the R language © 2016 The R Foundation. You can distribute the logo under the terms of the Creative Commons Attribution-ShareAlike 4.0 International license...
Facebook
TwitterKORUSAQ_Ground_Pandora_Data contains all of the Pandora instrumentation data collected during the KORUS-AQ field study. Contained in this dataset are column measurements of NO2, O3, and HCHO. Pandoras were situated at various ground sites across the study area, including, NIER-Taehwa, NIER-Olympic Park, NIER-Gwangju, NIER-Anmyeon, Busan, Yonsei University, Songchon, and Yeoju. Data collection for this product is complete.The KORUS-AQ field study was conducted in South Korea during May-June, 2016. The study was jointly sponsored by NASA and Korea’s National Institute of Environmental Research (NIER). The primary objectives were to investigate the factors controlling air quality in Korea (e.g., local emissions, chemical processes, and transboundary transport) and to assess future air quality observing strategies incorporating geostationary satellite observations. To achieve these science objectives, KORUS-AQ adopted a highly coordinated sampling strategy involved surface and airborne measurements including both in-situ and remote sensing instruments.Surface observations provided details on ground-level air quality conditions while airborne sampling provided an assessment of conditions aloft relevant to satellite observations and necessary to understand the role of emissions, chemistry, and dynamics in determining air quality outcomes. The sampling region covers the South Korean peninsula and surrounding waters with a primary focus on the Seoul Metropolitan Area. Airborne sampling was primarily conducted from near surface to about 8 km with extensive profiling to characterize the vertical distribution of pollutants and their precursors. The airborne observational data were collected from three aircraft platforms: the NASA DC-8, NASA B-200, and Hanseo King Air. Surface measurements were conducted from 16 ground sites and 2 ships: R/V Onnuri and R/V Jang Mok.The major data products collected from both the ground and air include in-situ measurements of trace gases (e.g., ozone, reactive nitrogen species, carbon monoxide and dioxide, methane, non-methane and oxygenated hydrocarbon species), aerosols (e.g., microphysical and optical properties and chemical composition), active remote sensing of ozone and aerosols, and passive remote sensing of NO2, CH2O, and O3 column densities. These data products support research focused on examining the impact of photochemistry and transport on ozone and aerosols, evaluating emissions inventories, and assessing the potential use of satellite observations in air quality studies.
Facebook
TwitterThe purpose of this project was added practice in learning new and demonstrate R Data analytical skills. The data set was located in Kaggle and shows sales information from the years 2010 to 2012. The weekly sales have two categories: holiday and non holiday representing 1 and 0 in that column respectfully.
The main question for this exercise was were there any factors that affected weekly sales for the stores? Those factors included temperature, fuel prices, and unemployment rates.
install.packages("tidyverse")
install.packages("dplyr")
install.packages("tsibble")
library("tidyverse")
library(readr)
library(dplyr)
library(ggplot2)
library(readr)
library(lubridate)
library(tsibble)
Walmart <- read.csv("C:/Users/matth/OneDrive/Desktop/Case Study/Walmart.csv")
Compared column names of each file to verify consistency.
colnames(Walmart)
colnames(Walmart)
dim(Walmart)
str(Walmart)
head(Walmart)
which(is.na(Walmart$Date))
sum(is.na(Walmart))
There is NA data in the set.
Walmart$Store<-as.factor(Walmart$Store)
Walmart$Holiday_Flag<-as.factor(Walmart$Holiday_Flag)
Walmart$week<-yearweek(as.Date(Walmart$Date,tryFormats=c("%d-%m-%Y"))) # make sure to install "tsibble"
Walmart$year<-format(as.Date(Walmart$Date,tryFormats=c("%d-%m-%Y")),"%Y")
Walmart_Holiday<-
filter(Walmart, Holiday_Flag==1)
Walmart_Non_Holiday<-
filter(Walmart, Holiday_Flag==0)
ggplot(Walmart, aes(x=Weekly_Sales, y=Store))+geom_boxplot()+ labs(title = 'Weekly Sales Accross 45 Stores',
x='Weekly sales', y='Store')+theme_bw()
From observation of the boxplot, it shows that Store 14 had max sales while Store 33 had the min sales.
Lets verify the results via slice_max and slice_min:
Walmart %>% slice_max(Weekly_Sales)
Walmart %>% slice_min(Weekly_Sales)
It looks the information was correct. Lets check the mean for the weekly_sales column:
mean(Walmart$Weekly_Sales)
The mean for Weekly_Sales column for the Walmart dataset was 1046965.
ggplot(Walmart_Holiday, aes(x=Weekly_Sales, y=Store))+geom_boxplot()+ labs(title = 'Holiday Sales Accross 45 Stores',
x='Weekly sales', y='Store')+theme_bw()
Store 4 had the highest weekly sales during a holiday week based on the boxplot. Boxplot shows stores 33 and 5 as some of the lowest holiday sales.Lets reverify with slice_max and slice_min:
Walmart_Holiday %>% slice_max(Weekly_Sales)
Walmart_Holiday %>% slice_min(Weekly_Sales)
The results match what is given on the boxplot. Lets find the mean:
mean(Walmart_Holiday$Weekly_Sales)
The result was that the mean was 1122888.
ggplot(Walmart_Non_Holiday, aes(x=Weekly_Sales, y=Store))+geom_boxplot()+ labs(title = 'Non Holiday Sales Accross 45 Stores', x='Weekly sales', y='Store')+theme_bw()
Lets matched the results of the Walmart dataset that had both non holiday weeks and holiday weeks. Store 14 had the max sales and store 33 had the minimum sales. Lets verify the results and find the mean:
Walmart_Non_Holiday %>% slice_max(Weekly_Sales)
Walmart_Non_Holiday %>% slice_min(Weekly_Sales)
mean(Walmart_Non_Holiday$Weekly_Sales)
Results matched. And the mean for weekly sales was 1041256.
ggplot(data = Walmart) + geom_point(mapping = aes(x=year, y=Weekly_Sales))
According the plot, 2010 had the most sales. Lets use a boxplot to see more.
ggplot(Walmart, aes(x=year, y=Weekly_Sales))+geom_boxplot()+ labs(title = 'Weekly Sales for Years 2010 - 2012',
x='Year', y='Weekly Sales')
2010 Saw higher sales numbers and higher medium
Lets start with holiday weekly sales:
ggplot(Walmart_Holiday, aes(x=year, y=Weekly_Sales))+geom_boxplot()+ labs(title = 'Holiday Weekly Sales for Years ...
Facebook
TwitterThis data release contains lake and reservoir water surface temperature summary statistics calculated from Landsat 8 Analysis Ready Dataset (ARD) images available within the Conterminous United States (CONUS) from 2013-2023. All zip files within this data release contain nested directories using .parquet files to store the data. The file example_script_for_using_parquet.R contains example code for using the R arrow package (Richardson and others, 2024) to open and query the nested .parquet files. Limitations with this dataset include: - All biases inherent to the Landsat Surface Temperature product are retained in this dataset which can produce unrealistically high or low estimates of water temperature. This is observed to happen, for example, in cases with partial cloud coverage over a waterbody. - Some waterbodies are split between multiple Landsat Analysis Ready Data tiles or orbit footprints. In these cases, multiple waterbody-wide statistics may be reported - one for each data tile. The deepest point values will be extracted and reported for tile covering the deepest point. A total of 947 waterbodies are split between multiple tiles (see the multiple_tiles = “yes” column of site_id_tile_hv_crosswalk.csv). - Temperature data were not extracted from satellite images with more than 90% cloud cover. - Temperature data represents skin temperature at the water surface and may differ from temperature observations from below the water surface. Potential methods for addressing limitations with this dataset: - Identifying and removing unrealistic temperature estimates: - Calculate total percentage of cloud pixels over a given waterbody as: percent_cloud_pixels = wb_dswe9_pixels/(wb_dswe9_pixels + wb_dswe1_pixels), and filter percent_cloud_pixels by a desired percentage of cloud coverage. - Remove lakes with a limited number of water pixel values available (wb_dswe1_pixels < 10) - Filter waterbodies where the deepest point is identified as water (dp_dswe = 1) - Handling waterbodies split between multiple tiles: - These waterbodies can be identified using the "site_id_tile_hv_crosswalk.csv" file (column multiple_tiles = “yes”). A user could combine sections of the same waterbody by spatially weighting the values using the number of water pixels available within each section (wb_dswe1_pixels). This should be done with caution, as some sections of the waterbody may have data available on different dates. All zip files within this data release contain nested directories using .parquet files to store the data. The example_script_for_using_parquet.R contains example code for using the R arrow package to open and query the nested .parquet files. - "year_byscene=XXXX.zip" – includes temperature summary statistics for individual waterbodies and the deepest points (the furthest point from land within a waterbody) within each waterbody by the scene_date (when the satellite passed over). Individual waterbodies are identified by the National Hydrography Dataset (NHD) permanent_identifier included within the site_id column. Some of the .parquet files with the byscene datasets may only include one dummy row of data (identified by tile_hv="000-000"). This happens when no tabular data is extracted from the raster images because of clouds obscuring the image, a tile that covers mostly ocean with a very small amount of land, or other possible. An example file path for this dataset follows: year_byscene=2023/tile_hv=002-001/part-0.parquet -"year=XXXX.zip" – includes the summary statistics for individual waterbodies and the deepest points within each waterbody by the year (dataset=annual), month (year=0, dataset=monthly), and year-month (dataset=yrmon). The year_byscene=XXXX is used as input for generating these summary tables that aggregates temperature data by year, month, and year-month. Aggregated data is not available for the following tiles: 001-004, 001-010, 002-012, 028-013, and 029-012, because these tiles primarily cover ocean with limited land, and no output data were generated. An example file path for this dataset follows: year=2023/dataset=lakes_annual/tile_hv=002-001/part-0.parquet - "example_script_for_using_parquet.R" – This script includes code to download zip files directly from ScienceBase, identify HUC04 basins within desired landsat ARD grid tile, download NHDplus High Resolution data for visualizing, using the R arrow package to compile .parquet files in nested directories, and create example static and interactive maps. - "nhd_HUC04s_ingrid.csv" – This cross-walk file identifies the HUC04 watersheds within each Landsat ARD Tile grid. -"site_id_tile_hv_crosswalk.csv" - This cross-walk file identifies the site_id (nhdhr{permanent_identifier}) within each Landsat ARD Tile grid. This file also includes a column (multiple_tiles) to identify site_id's that fall within multiple Landsat ARD Tile grids. - "lst_grid.png" – a map of the Landsat grid tiles labelled by the horizontal – vertical ID.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Background and Purpose:
There are very few publicly available datasets on real-world falls in scientific literature due to the lack of natural falls and the inherent difficulties in gathering biomechanical and physiological data from young subjects or older adults residing in their communities in a non-intrusive and user-friendly manner. This data gap hindered research on fall prevention strategies. Immersive Virtual Reality (VR) environments provide a unique solution.
This dataset supports research in fall prevention by providing an immersive VR setup that simulates diverse ecological environments and randomized visual disturbances, aimed at triggering and analyzing balance-compensatory reactions. The dataset is a unique tool for studying human balance responses to VR-induced perturbations, facilitating research that could inform training programs, wearable assistive technologies, and VR-based rehabilitation methods.
Dataset Content:The dataset includes:
Kinematic Data: Captured using a full-body Xsens MVN Awinda inertial measurement system, providing detailed movement data at 60 Hz.
Muscle Activity (EMG): Recorded at 1111 Hz using Delsys Trigno for tracking muscle contractions.
Electrodermal Activity (EDA)*: Captured at 100.21 Hz with a Shimmer GSR device on the dominant forearm to record physiological responses to perturbations.
Metadata: Includes participant demographics (age, height, weight, gender, dominant hand and foot), trial conditions, and perturbation characteristics (timing and type).
The files are named in the format "ParticipantX_labelled", where X represents the participant's number. Each file is provided in a .mat format, with data already synchronized across different sensor sources. The structure of each file is organized into the following columns:
Column 1: Label indicating the visual perturbation applied. 0 means no visual perturbation.
Column 2: Timestamp, providing the precise timing of each recorded data point.
Column 3: Frame identifier, which can be cross-referenced with the MVN file for detailed motion analysis.
Columns 4 to 985: Xsens motion capture features, exported directly from the MVN file.
Columns 986 to 993: EMG data - Tibialis Anterior (R&L), Gastrocnemius Medial Head (R&L), Rectus Femoris (R), Semitendinosus (R), External Oblique (R), Sternocleidomastoid (R).
Columns 994 to 1008: Shimmer data: Accelerometer (x,y,z), Gyroscope (x,y,z), Magnetometer (x,y,z), GSR Range, Skin Conductance, Skin Resistance, PPG, Pressure, Temperature.
In addition, we are also releasing the .MVN and .MVNA files for each participant (1 to 10), which provide comprehensive motion capture data and include the participants' body measurements, respectively. This additional data enables precise body modeling and further in-depth biomechanical analysis.
Participants & VR Headset:
Twelve healthy young adults (average age: 25.09 ± 2.81 years; height: 167.82 ± 8.40 cm; weight: 64.83 ± 7.77 kg; 6 males, 6 females) participated in this study (Table 1). Participants met the following criteria: i) healthy locomotion, ii) stable postural balance, iii) age ≥ 18 years, and iv) body weight < 135 kg.
Participants were excluded if they: i) had any condition affecting locomotion, ii) had epilepsy, vestibular disorders, or other neurological conditions impacting stability, iii) had undergone recent surgeries impacting mobility, iv) were involved in other experimental studies, v) were under judicial protection or guardianship, or vi) experienced complications using VR headsets (e.g., motion sickness).
All participants provided written informed consent, adhering to the ethical guidelines set by the University of Minho Ethics Committee (CEICVS 063/2021), in compliance with the Declaration of Helsinki and the Oviedo Convention.
To ensure unbiased reactions, participants were kept unaware of the specific protocol details. Visual disturbances were introduced in a random sequence and at various locations, enhancing the unpredictability of the experiment and simulating a naturalistic response.
The VR setup involved an HTC Vive Pro headset with two wirelessly synchronized base stations that tracked participants’ head movements within a 5m x 2.5m area. The base stations adjusted the VR environment’s perspective according to head movements, while controllers were used solely for setup purposes.
Table 1 - Participants' demographic information
Participant Height (cm) Weight (kg) Age Gender Dom. Hand Dom. Foot
1 159 56.5 23 F Right Right
2 157 55.3 28 F Right Right
3 174 67.1 31 M Right Right
4 176 73.8 23 M Right Right
5 158 57.3 23 F Right Right
6 181 70.9 27 M Right Right
7 171 73.3 23 M Right Right
8 159 69.2 28 F Right Right
9 177 57.3 22 M Right Right
10 171 75.5 25 M Right Right
11 163 58.1 23 F Right Right
12 168 63.7 25 F Right Right
Data Collection Methodology:
The experimental protocol was designed to integrate four essential components: (i) precise control over stimuli, (ii) high reproducibility of the experimental conditions, (iii) preservation of ecological validity, and (iv) promotion of real-world learning transfer.
Participant Instructions and Familiarization Trial: Before starting, participants were given specific instructions to (i) seek assistance if they experienced motion sickness, (ii) adjust the VR headset for comfort by modifying the lens distance and headset fit, (iii) stay within the defined virtual play area demarcated by a blue boundary, and (iv) complete a familiarization trial. During this trial, participants were encouraged to explore various virtual environments while performing a sequence of three key movements—walking forward, turning around, and returning to the initial location—without any visual perturbations. This familiarization phase helped participants acclimate to the virtual space in a controlled setting.
Experimental Protocol and Visual Perturbations: Participants were exposed to 11 different types of visual perturbations as outlined in Table 2, applied across a total of 35 unique perturbation variants (Table 3). Each variant involved the same type of perturbation, such as a clockwise Roll Axis Tilt, but varied in intensity (e.g., rotation speed) and was presented in randomized virtual locations. The selection of perturbation types was grounded in existing literature on visual disturbances. This design ensured that participants experienced a diverse range of visual effects in a manner that maintained ecological validity, supporting the potential for generalization to real-world scenarios where visual perturbations might occur spontaneously.
Protocol Flow and Randomized Presentation: Throughout the experimental protocol, each visual perturbation variant was presented three times, and participants engaged repeatedly in the familiarization activities over a nearly one-hour period. These activities—walking forward, turning around, and returning to the starting point—took place in a 5m x 2.5m physical space mirrored in VR, allowing participants to take 7–10 steps before turning. Participants were not informed of the timing or nature of any perturbations, which could occur unpredictably during their forward walk, adding a realistic element of surprise. After each return to the starting point, participants were relocated to a random position within the virtual environment, with the sequence of positions determined by a randomized, computer-generated order.
Table 2 - Visual perturbations' name and parameters (L - Lateral; B - Backward; F - Forward; S - Slip; T - Trip; CW- Clockwise; CCW - Counter-Clockwise)
Perturbation [Fall Category]
Parameters
Roll Axis Tilt - CW [L] [10º, 20º, 30º] during 0.5s
Roll Axis Tilt – CCW [L] [10º, 20º, 30º] during 0.5s
Support Surface ML Axis Translation - Bidirectional [L] Discrete Movement (static pauses between movements) – 1 m/s
AP Axis Translation - Front [F] 1 m/s
AP Axis Translation - Backwards [B] 1 m/s
Pitch Axis Tilt [S] 0º-25º, 60º/s
Virtual object with lower height than a real object [T] Variable object height
Roll-Pitch-Yaw Axis Tilt [Syncope] Sum of sinusoids drive each axis rotation
Scene Object Movement [L] Objects fly towards the subject’s head. Variable speeds
Vertigo Sensation [F/L] Walk at a comfortable speed. With and without avatar. House’s height
Axial Axis Translation [F/B/L] Free fall
Table 3 - Label Encoding
Visual Perturbation Label Visual Perturbation Label Visual Perturbation Label
Roll Indoor 1 CW10 1 Roll Indoor 1 CW20 2 Roll Indoor 1 CW30 3
Roll Indoor 1 CCW10 4 Roll Indoor 1 CCW20 5 Roll Indoor 1 CCW30 6
Roll Indoor 2 CW10 7 Roll Indoor 2 CW20 8 Roll Indoor 2 CW30 9
Roll Indoor 2 CCW10 10 Roll Indoor 2 CCW20 11 Roll Indoor 2 CCW30 12
Roll Outdoor CW10 13 Roll Outdoor CW20 14 Roll Outdoor CW30 15
Roll Outdoor CCW10 16 Roll Outdoor CCW20 17 Roll Outdoor CCW30 18
ML-Axis Trans. - Kitchen 19 AP-Axis Trans. - Corridor Forward 20 AP-Axis Trans. - Corridor Backward 21
Pitch Indoor - Bathroom (wet floor) 22 Pitch Indoor - Near Fridge (wet floor) 23 Roof Beam Walking - Vertigo 24
Roof Beam Walking - Vertigo No Avatar 25 Simple Roof - Vertigo 26 Simple Roof - Vertigo No Avatar 27
Pitch Outdoor - Near Car Oil 28 Trip - Sidewalk / Trip Shock 29/290 Bedroom Syncope 30
Garden - Object Avoidance 31 Electricity Pole - Vertigo 32 Electricity Pole - No Avatar 33
Free Fall 34 Climbing Virtual Stairs 35
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
General overview
The following datasets are described by this metadata record, and are available for download from the provided URL.
- Raw log files, physical parameters raw log files
- Raw excel files, respiration/PAM chamber raw excel spreadsheets
- Processed and cleaned excel files, respiration chamber biomass data
- Raw rapid light curve excel files (this is duplicated from Raw log files), combined dataset pH, temperature, oxygen, salinity, velocity for experiment
- Associated R script file for pump cycles of respirations chambers
####
Physical parameters raw log files
Raw log files
1) DATE=
2) Time= UTC+11
3) PROG=Automated program to control sensors and collect data
4) BAT=Amount of battery remaining
5) STEP=check aquation manual
6) SPIES=check aquation manual
7) PAR=Photoactive radiation
8) Levels=check aquation manual
9) Pumps= program for pumps
10) WQM=check aquation manual
####
Respiration/PAM chamber raw excel spreadsheets
Abbreviations in headers of datasets
Note: Two data sets are provided in different formats. Raw and cleaned (adj). These are the same data with the PAR column moved over to PAR.all for analysis. All headers are the same. The cleaned (adj) dataframe will work with the R syntax below, alternative add code to do cleaning in R.
Date: ISO 1986 - Check
Time:UTC+11 unless otherwise stated
DATETIME: UTC+11 unless otherwise stated
ID (of instrument in respiration chambers)
ID43=Pulse amplitude fluoresence measurement of control
ID44=Pulse amplitude fluoresence measurement of acidified chamber
ID=1 Dissolved oxygen
ID=2 Dissolved oxygen
ID3= PAR
ID4= PAR
PAR=Photo active radiation umols
F0=minimal florescence from PAM
Fm=Maximum fluorescence from PAM
Yield=(F0 – Fm)/Fm
rChl=an estimate of chlorophyll (Note this is uncalibrated and is an estimate only)
Temp=Temperature degrees C
PAR=Photo active radiation
PAR2= Photo active radiation2
DO=Dissolved oxygen
%Sat= Saturation of dissolved oxygen
Notes=This is the program of the underwater submersible logger with the following abreviations:
Notes-1) PAM=
Notes-2) PAM=Gain level set (see aquation manual for more detail)
Notes-3) Acclimatisation= Program of slowly introducing treatment water into chamber
Notes-4) Shutter start up 2 sensors+sample…= Shutter PAMs automatic set up procedure (see aquation manual)
Notes-5) Yield step 2=PAM yield measurement and calculation of control
Notes-6) Yield step 5= PAM yield measurement and calculation of acidified
Notes-7) Abatus respiration DO and PAR step 1= Program to measure dissolved oxygen and PAR (see aquation manual). Steps 1-4 are different stages of this program including pump cycles, DO and PAR measurements.
8) Rapid light curve data
Pre LC: A yield measurement prior to the following measurement
After 10.0 sec at 0.5% to 8%: Level of each of the 8 steps of the rapid light curve
Odessey PAR (only in some deployments): An extra measure of PAR (umols) using an Odessey data logger
Dataflow PAR: An extra measure of PAR (umols) using a Dataflow sensor.
PAM PAR: This is copied from the PAR or PAR2 column
PAR all: This is the complete PAR file and should be used
Deployment: Identifying which deployment the data came from
####
Respiration chamber biomass data
The data is chlorophyll a biomass from cores from the respiration chambers. The headers are: Depth (mm) Treat (Acidified or control) Chl a (pigment and indicator of biomass) Core (5 cores were collected from each chamber, three were analysed for chl a), these are psudoreplicates/subsamples from the chambers and should not be treated as replicates.
####
Associated R script file for pump cycles of respirations chambers
Associated respiration chamber data to determine the times when respiration chamber pumps delivered treatment water to chambers. Determined from Aquation log files (see associated files). Use the chamber cut times to determine net production rates. Note: Users need to avoid the times when the respiration chambers are delivering water as this will give incorrect results. The headers that get used in the attached/associated R file are start regression and end regression. The remaining headers are not used unless called for in the associated R script. The last columns of these datasets (intercept, ElapsedTimeMincoef) are determined from the linear regressions described below.
To determine the rate of change of net production, coefficients of the regression of oxygen consumption in discrete 180 minute data blocks were determined. R squared values for fitted regressions of these coefficients were consistently high (greater than 0.9). We make two assumptions with calculation of net production rates: the first is that heterotrophic community members do not change their metabolism under OA; and the second is that the heterotrophic communities are similar between treatments.
####
Combined dataset pH, temperature, oxygen, salinity, velocity for experiment
This data is rapid light curve data generated from a Shutter PAM fluorimeter. There are eight steps in each rapid light curve. Note: The software component of the Shutter PAM fluorimeter for sensor 44 appeared to be damaged and would not cycle through the PAR cycles. Therefore the rapid light curves and recovery curves should only be used for the control chambers (sensor ID43).
The headers are
PAR: Photoactive radiation
relETR: F0/Fm x PAR
Notes: Stage/step of light curve
Treatment: Acidified or control
The associated light treatments in each stage. Each actinic light intensity is held for 10 seconds, then a saturating pulse is taken (see PAM methods).
After 10.0 sec at 0.5% = 1 umols PAR
After 10.0 sec at 0.7% = 1 umols PAR
After 10.0 sec at 1.1% = 0.96 umols PAR
After 10.0 sec at 1.6% = 4.32 umols PAR
After 10.0 sec at 2.4% = 4.32 umols PAR
After 10.0 sec at 3.6% = 8.31 umols PAR
After 10.0 sec at 5.3% =15.78 umols PAR
After 10.0 sec at 8.0% = 25.75 umols PAR
This dataset appears to be missing data, note D5 rows potentially not useable information
See the word document in the download file for more information.
Facebook
TwitterThis dataset contains files reconstructing single-cell data presented in 'Reference transcriptomics of porcine peripheral immune cells created through bulk and single-cell RNA sequencing' by Herrera-Uribe & Wiarda et al. 2021. Samples of peripheral blood mononuclear cells (PBMCs) were collected from seven pigs and processed for single-cell RNA sequencing (scRNA-seq) in order to provide a reference annotation of porcine immune cell transcriptomics at enhanced, single-cell resolution. Analysis of single-cell data allowed identification of 36 cell clusters that were further classified into 13 cell types, including monocytes, dendritic cells, B cells, antibody-secreting cells, numerous populations of T cells, NK cells, and erythrocytes. Files may be used to reconstruct the data as presented in the manuscript, allowing for individual query by other users. Scripts for original data analysis are available at https://github.com/USDA-FSEPRU/PorcinePBMCs_bulkRNAseq_scRNAseq. Raw data are available at https://www.ebi.ac.uk/ena/browser/view/PRJEB43826. Funding for this dataset was also provided by NRSP8: National Animal Genome Research Program (https://www.nimss.org/projects/view/mrp/outline/18464). Resources in this dataset:Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells 10X Format. File Name: PBMC7_AllCells.zipResource Description: Zipped folder containing PBMC counts matrix, gene names, and cell IDs. Files are as follows: matrix of gene counts* (matrix.mtx.gx) gene names (features.tsv.gz) cell IDs (barcodes.tsv.gz) *The ‘raw’ count matrix is actually gene counts obtained following ambient RNA removal. During ambient RNA removal, we specified to calculate non-integer count estimations, so most gene counts are actually non-integer values in this matrix but should still be treated as raw/unnormalized data that requires further normalization/transformation. Data can be read into R using the function Read10X().Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells Metadata. File Name: PBMC7_AllCells_meta.csvResource Description: .csv file containing metadata for cells included in the final dataset. Metadata columns include: nCount_RNA = the number of transcripts detected in a cell nFeature_RNA = the number of genes detected in a cell Loupe = cell barcodes; correspond to the cell IDs found in the .h5Seurat and 10X formatted objects for all cells prcntMito = percent mitochondrial reads in a cell Scrublet = doublet probability score assigned to a cell seurat_clusters = cluster ID assigned to a cell PaperIDs = sample ID for a cell celltypes = cell type ID assigned to a cellResource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells PCA Coordinates. File Name: PBMC7_AllCells_PCAcoord.csvResource Description: .csv file containing first 100 PCA coordinates for cells. Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells t-SNE Coordinates. File Name: PBMC7_AllCells_tSNEcoord.csvResource Description: .csv file containing t-SNE coordinates for all cells.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells UMAP Coordinates. File Name: PBMC7_AllCells_UMAPcoord.csvResource Description: .csv file containing UMAP coordinates for all cells.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - CD4 T Cells t-SNE Coordinates. File Name: PBMC7_CD4only_tSNEcoord.csvResource Description: .csv file containing t-SNE coordinates for only CD4 T cells (clusters 0, 3, 4, 28). A dataset of only CD4 T cells can be re-created from the PBMC7_AllCells.h5Seurat, and t-SNE coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - CD4 T Cells UMAP Coordinates. File Name: PBMC7_CD4only_UMAPcoord.csvResource Description: .csv file containing UMAP coordinates for only CD4 T cells (clusters 0, 3, 4, 28). A dataset of only CD4 T cells can be re-created from the PBMC7_AllCells.h5Seurat, and UMAP coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gamma Delta T Cells UMAP Coordinates. File Name: PBMC7_GDonly_UMAPcoord.csvResource Description: .csv file containing UMAP coordinates for only gamma delta T cells (clusters 6, 21, 24, 31). A dataset of only gamma delta T cells can be re-created from the PBMC7_AllCells.h5Seurat, and UMAP coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gamma Delta T Cells t-SNE Coordinates. File Name: PBMC7_GDonly_tSNEcoord.csvResource Description: .csv file containing t-SNE coordinates for only gamma delta T cells (clusters 6, 21, 24, 31). A dataset of only gamma delta T cells can be re-created from the PBMC7_AllCells.h5Seurat, and t-SNE coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gene Annotation Information. File Name: UnfilteredGeneInfo.txtResource Description: .txt file containing gene nomenclature information used to assign gene names in the dataset. 'Name' column corresponds to the name assigned to a feature in the dataset.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells H5Seurat. File Name: PBMC7.tarResource Description: .h5Seurat object of all cells in PBMC dataset. File needs to be untarred, then read into R using function LoadH5Seurat().
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains all the scripts and data used in the analysis of the LTMP data presented in the manuscript “Longer time series with missing data improve parameter estimation in State-Space mode in coral reef fish communities”. There are 22 files in total.All model fits were run on the HPC cluster at James Cook University. The model fit to the 11-year time series took approximately 3-5 days and the model fit to the 25-year time series took approximately 10-12 days. We did not include the model fits as they are big files (~12-30GB) but these can be obtained by running the corresponding scripts.LTMP data and data wranglingLTMP_data_1995_2005_prop_zero_40sp.RData: File containing 45 columns. The first column is Year and it contains the year for each observation in the dataset. The second column Reef contains the reef name, while the latitude and longitude are collected in the third column called Reef_lat and fourth column called Reef_long, respectively. The fifth column is called Shelf and contains the reef shelf position as I for Inner shelf positioning, M for Middle Shelf positioning and O for outer Shelf positioning. The rest of the columns contain the counts of the 40 species with the lowest proportion of zeros in the LTMP data. This contains data from 1995 to 2005.LTMP_data_1995_2019_prop_zero_40sp.RData: Same data structure as above but for the time series from 1995 to 2019 (includes Nas in some of the abundance counts).dw_11y_Pomacentrids.R and dw_25yNA_Pomacentrids.R scripts order species in pomacentrids and non-pomacentrids so the models can be fitted to the data. These files produce the data files LTMP_data_1995_2005_prop_zero_40sp_Pomacentrids.RData and LTMP_data_1995_2019_prop_zero_40sp_PomacentridsNA.RData.Model fittingLTMP_fit_40sp.R is a script that fits the model to the 11-year time series data. Specifically, the input dataset is LTMP_data_1995_2005_prop_zero_40sp_Pomacentrids.RData and the output fit is called LTMP_fit_40sp.RData.LTMP_fit_40sp_NA.R is a script that fits the model to the 25-year time series with missing data. Specifically, the input dataset is LTMP_data_1995_2019_prop_zero_40sp_PomacentridsNA.RData and the output fit is called LTMP_fit_40sp_NA.RData.Stan modelMARPLN_LV_Pomacentrids.stan: Stan code for the multivariate autoregressive Poisson-Lognormal model with the latent variables.MARPLN_LV_Pomacentrids_NA.stan: Stan code for same model as above, but this can deal with missing data.FiguresFigure 1 A and B.R and Figure 4.R produce the corresponding figures in the main text.Note that Figure 1A and B.R requires several files to produce the GBR and Australia maps. These are:Great_Barrier_Reef_Features.cpgGreat_Barrier_Reef_Features.dbfGreat_Barrier_Reef_Features.lyrGreat_Barrier_Reef_Features.shp.xmlReef_lat_long.csvGreat_Barrier_Reef_Features.prjGreat_Barrier_Reef_Features.sbnGreat_Barrier_Reef_Features.sbxGreat_Barrier_Reef_Features.shpGreat_Barrier_Reef_Features.shx
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data and results from the Imageomics Workflow. These include data files from the Fish-AIR repository (https://fishair.org/) for purposes of reproducibility and outputs from the application-specific imageomics workflow contained in the Minnow_Segmented_Traits repository (https://github.com/hdr-bgnn/Minnow_Segmented_Traits).
Fish-AIR: This is the dataset downloaded from Fish-AIR, filtering for Cyprinidae and the Great Lakes Invasive Network (GLIN) from the Illinois Natural History Survey (INHS) dataset. These files contain information about fish images, fish image quality, and path for downloading the images. The data download ARK ID is dtspz368c00q. (2023-04-05). The following files are unaltered from the Fish-AIR download. We use the following files:
extendedImageMetadata.csv: A CSV file containing information about each image file. It has the following columns: ARKID, fileNameAsDelivered, format, createDate, metadataDate, size, width, height, license, publisher, ownerInstitutionCode. Column definitions are defined https://fishair.org/vocabulary.html and the persistent column identifiers are in the meta.xml file.
imageQualityMetadata.csv: A CSV file containing information about the quality of each image. It has the following columns: ARKID, license, publisher, ownerInstitutionCode, createDate, metadataDate, specimenQuantity, containsScaleBar, containsLabel, accessionNumberValidity, containsBarcode, containsColorBar, nonSpecimenObjects, partsOverlapping, specimenAngle, specimenView, specimenCurved, partsMissing, allPartsVisible, partsFolded, brightness, uniformBackground, onFocus, colorIssue, quality, resourceCreationTechnique. Column definitions are defined https://fishair.org/vocabulary.html and the persistent column identifiers are in the meta.xml file.
multimedia.csv: A CSV file containing information about image downloads. It has the following columns: ARKID, parentARKID, accessURI, createDate, modifyDate, fileNameAsDelivered, format, scientificName, genus, family, batchARKID, batchName, license, source, ownerInstitutionCode. Column definitions are defined https://fishair.org/vocabulary.html and the persistent column identifiers are in the meta.xml file.
meta.xml: A XML file with the metadata about the column indices and URIs for each file contained in the original downloaded zip file. This file is used in the fish-air.R script to extract the indices for column headers.
The outputs from the Minnow_Segmented_Traits workflow are:
sampling.df.seg.csv: Table with tallies of the sampling of image data per species during the data cleaning and data analysis. This is used in Table S1 in Balk et al.
presence.absence.matrix.csv: The Presence-Absence matrix from segmentation, not cleaned. This is the result of the combined outputs from the presence.json files created by the rule “create_morphological_analysis”. The cleaned version of this matrix is shown as Table S3 in Balk et al.
heatmap.avg.blob.png and heatmap.sd.blob.png: Heatmaps of average area of biggest blob per trait (heatmap.avg.blob.png) and standard deviation of area of biggest blob per trait (heatmap.sd.blob.png). These images are also in Figure S3 of Balk et al.
minnow.filtered.from.iqm.csv: Filtered fish image data set after filtering (see methods in Balk et al. for filter categories).
burress.minnow.sp.filtered.from.iqm.csv: Fish image data set after filtering and selecting species from Burress et al. 2017.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset comprises all trace element measurements made on suspended particulate samples collected by filtration from GO-Flo bottles (towed fish data in a separate dataset) during the US GEOTRACES GP16 cruise, R/V Thomas G Thompson 303. It combines data produced at Bigelow Laboratory (upper water column) and at Rutgers University (intermediate and deep water column). NOTE: At station 18, near-plume depths were sampled in two separated casts deployed at two different times during the long station. The results are interleaved in this data file, but can be distinguished and separated by using sequential GEOTRACES numbers (always every other number since only every other bottle was used for particle sampling).
This dataset contains total and labile particulate element concentrations data via Inductively-coupled plasma mass spectrometry.
Trace element concentrations in suspended particles were collected from GO-Flo bottles (towed fish data are in a separate dataset) and then by filtration onto 0.45um Supor (Pall Gellman) polyethersulfone filters. Particulate matter on filters was completely digested in hot acids and resulting solutions were analyzed using inductively-coupled plasma mass spectrometry (ICP-MS). Labile (weak acid leachable) particulate element concentrations are also reported for the samples analyzed at Bigelow Labs.
Parameter names, definitions and Units:
Concentrations of total suspended particulate trace elements are indicated as the element symbol alone: Al, Ba, Cd, (Ce), Co, (Cr), Cu, Fe, La, Mn, (Nd), Ni, P, Pb, (Sc), Th, Ti, V, Y, Zn (elements in parentheses analyzed at Rutgers only). Concentrations of the labile fraction of these particulate elements are indicated as element names followed by the suffix ‘_L’. Volumetric concentrations and concentration errors for all particulate fractions (>0.45 um) are reported in units of pmol/L. The GEOTRACES sample number is in column ‘GEOTRC_SAMPNO’, and the analytical lab is found in column ‘Lab’ (where 1= Bigelow, 2=Rutgers).
Several elements were determined but are not reported here because data quality was compromised by large seasalt or process blank corrections, or because data quality simply has not been fully evaluated yet (REEs at Rutgers). However, it is possible that there is some future utility in these data, and we ask that users contact Bigelow and Rutgers who can provide the data on an individual basis. The analyzed but excluded elements include Mo, Sc, and Sr for Bigelow, and Ca, Mo, Rb, Sr, and the remaining REEs Pr, Sm, Eu, Tb, Gd, Dy, Ho, Er, Tm, Yb, and Lu for Rutgers. We expect that these REEs will be reported in a future iteration of this data report.
Uncertainties (errors) for both total and labile element concentrations are reported with the additional suffix “_Error”. (N.B. DMO changed to "_err" for consistency.)
For samples (station/depth combinations) lacking replicate analyses (i.e. most samples; flag code generally ‘2’—see below for flag code descriptions), uncertainties are reported as propagated errors accounting for each step of analysis and data processing, as described below.
Access restrictions:
It is our understanding that access to the data will be restricted to the GEOTRACES Data Management Committee (DMC) and Standards and Intercalibration Committee (SIC) until the product is publicly released in the next Intermediate Data Product, currently August, 2017.
Related files:
Blanks and Detection Limits and related derived quantities from Rutgers and Bigelow labs (png)
Median process blanks for two labs compared (pmol for 1/2 filter digested)(png)
Median percent process blank correction for reported data from two labs (png)
Total elemental recoveries for certified reference materials (CRMs) reported by each participating lab (png)
Original XL file.
Intercalibration for suspended particulate elements between Sherrell (Rutgers) and Twining (Bigelow) groups (docx)
Duplicated Depths by Lab.png
Duplicated Depths by Cast (Shallow-Intermediate).png
Deplicated Depths by Cast (Intermediate-Deep).png
Original dataset - not separated into Bottle and Fish
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This folder contains the input files required for the R code used to analyse data for the Patterns and prevalence of food allergy in adulthood in the UK (PAFA) project. This includes:pafa_data_dictionary_anonymised.csv: The data dictionary describing each column in the anonymised PAFA dataset. "snomed_field_name" lists all column names in the dataset; "field_name_extended" lists the original column name in the REDCap data download, which was then recoded to include SNOMED and FoodEx2 codes for future analyses; "variable_field_name" denotes the corresponding coded field name in the REDCap form; "field_type" denotes the type of REDCap field; "field_label" describes the field name in plain language; "choices_calculations_or_slider_labels" describes the choices provided to the participant for that question.foodex2_codes_with_other.csv: A CSV file with key-value pairs for identifying foods coded in the dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The Russian Financial Statements Database (RFSD) is an open, harmonized collection of annual unconsolidated financial statements of the universe of Russian firms:
🔓 First open data set with information on every active firm in Russia.
🗂️ First open financial statements data set that includes non-filing firms.
🏛️ Sourced from two official data providers: the Rosstat and the Federal Tax Service.
📅 Covers 2011-2023 initially, will be continuously updated.
🏗️ Restores as much data as possible through non-invasive data imputation, statement articulation, and harmonization.
The RFSD is hosted on 🤗 Hugging Face and Zenodo and is stored in a structured, column-oriented, compressed binary format Apache Parquet with yearly partitioning scheme, enabling end-users to query only variables of interest at scale.
The accompanying paper provides internal and external validation of the data: http://arxiv.org/abs/2501.05841.
Here we present the instructions for importing the data in R or Python environment. Please consult with the project repository for more information: http://github.com/irlcode/RFSD.
Importing The Data
You have two options to ingest the data: download the .parquet files manually from Hugging Face or Zenodo or rely on 🤗 Hugging Face Datasets library.
Python
🤗 Hugging Face Datasets
It is as easy as:
from datasets import load_dataset import polars as pl
RFSD = load_dataset('irlspbru/RFSD')
RFSD_2023 = pl.read_parquet('hf://datasets/irlspbru/RFSD/RFSD/year=2023/*.parquet')
Please note that the data is not shuffled within year, meaning that streaming first n rows will not yield a random sample.
Local File Import
Importing in Python requires pyarrow package installed.
import pyarrow.dataset as ds import polars as pl
RFSD = ds.dataset("local/path/to/RFSD")
print(RFSD.schema)
RFSD_full = pl.from_arrow(RFSD.to_table())
RFSD_2019 = pl.from_arrow(RFSD.to_table(filter=ds.field('year') == 2019))
RFSD_2019_revenue = pl.from_arrow( RFSD.to_table( filter=ds.field('year') == 2019, columns=['inn', 'line_2110'] ) )
renaming_df = pl.read_csv('local/path/to/descriptive_names_dict.csv') RFSD_full = RFSD_full.rename({item[0]: item[1] for item in zip(renaming_df['original'], renaming_df['descriptive'])})
R
Local File Import
Importing in R requires arrow package installed.
library(arrow) library(data.table)
RFSD <- open_dataset("local/path/to/RFSD")
schema(RFSD)
scanner <- Scanner$create(RFSD) RFSD_full <- as.data.table(scanner$ToTable())
scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scanner <- scan_builder$Finish() RFSD_2019 <- as.data.table(scanner$ToTable())
scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scan_builder$Project(cols = c("inn", "line_2110")) scanner <- scan_builder$Finish() RFSD_2019_revenue <- as.data.table(scanner$ToTable())
renaming_dt <- fread("local/path/to/descriptive_names_dict.csv") setnames(RFSD_full, old = renaming_dt$original, new = renaming_dt$descriptive)
Use Cases
🌍 For macroeconomists: Replication of a Bank of Russia study of the cost channel of monetary policy in Russia by Mogiliat et al. (2024) — interest_payments.md
🏭 For IO: Replication of the total factor productivity estimation by Kaukin and Zhemkova (2023) — tfp.md
🗺️ For economic geographers: A novel model-less house-level GDP spatialization that capitalizes on geocoding of firm addresses — spatialization.md
FAQ
Why should I use this data instead of Interfax's SPARK, Moody's Ruslana, or Kontur's Focus?hat is the data period?
To the best of our knowledge, the RFSD is the only open data set with up-to-date financial statements of Russian companies published under a permissive licence. Apart from being free-to-use, the RFSD benefits from data harmonization and error detection procedures unavailable in commercial sources. Finally, the data can be easily ingested in any statistical package with minimal effort.
What is the data period?
We provide financials for Russian firms in 2011-2023. We will add the data for 2024 by July, 2025 (see Version and Update Policy below).
Why are there no data for firm X in year Y?
Although the RFSD strives to be an all-encompassing database of financial statements, end users will encounter data gaps:
We do not include financials for firms that we considered ineligible to submit financial statements to the Rosstat/Federal Tax Service by law: financial, religious, or state organizations (state-owned commercial firms are still in the data).
Eligible firms may enjoy the right not to disclose under certain conditions. For instance, Gazprom did not file in 2022 and we had to impute its 2022 data from 2023 filings. Sibur filed only in 2023, Novatek — in 2020 and 2021. Commercial data providers such as Interfax's SPARK enjoy dedicated access to the Federal Tax Service data and therefore are able source this information elsewhere.
Firm may have submitted its annual statement but, according to the Uniform State Register of Legal Entities (EGRUL), it was not active in this year. We remove those filings.
Why is the geolocation of firm X incorrect?
We use Nominatim to geocode structured addresses of incorporation of legal entities from the EGRUL. There may be errors in the original addresses that prevent us from geocoding firms to a particular house. Gazprom, for instance, is geocoded up to a house level in 2014 and 2021-2023, but only at street level for 2015-2020 due to improper handling of the house number by Nominatim. In that case we have fallen back to street-level geocoding. Additionally, streets in different districts of one city may share identical names. We have ignored those problems in our geocoding and invite your submissions. Finally, address of incorporation may not correspond with plant locations. For instance, Rosneft has 62 field offices in addition to the central office in Moscow. We ignore the location of such offices in our geocoding, but subsidiaries set up as separate legal entities are still geocoded.
Why is the data for firm X different from https://bo.nalog.ru/?
Many firms submit correcting statements after the initial filing. While we have downloaded the data way past the April, 2024 deadline for 2023 filings, firms may have kept submitting the correcting statements. We will capture them in the future releases.
Why is the data for firm X unrealistic?
We provide the source data as is, with minimal changes. Consider a relatively unknown LLC Banknota. It reported 3.7 trillion rubles in revenue in 2023, or 2% of Russia's GDP. This is obviously an outlier firm with unrealistic financials. We manually reviewed the data and flagged such firms for user consideration (variable outlier), keeping the source data intact.
Why is the data for groups of companies different from their IFRS statements?
We should stress that we provide unconsolidated financial statements filed according to the Russian accounting standards, meaning that it would be wrong to infer financials for corporate groups with this data. Gazprom, for instance, had over 800 affiliated entities and to study this corporate group in its entirety it is not enough to consider financials of the parent company.
Why is the data not in CSV?
The data is provided in Apache Parquet format. This is a structured, column-oriented, compressed binary format allowing for conditional subsetting of columns and rows. In other words, you can easily query financials of companies of interest, keeping only variables of interest in memory, greatly reducing data footprint.
Version and Update Policy
Version (SemVer): 1.0.0.
We intend to update the RFSD annualy as the data becomes available, in other words when most of the firms have their statements filed with the Federal Tax Service. The official deadline for filing of previous year statements is April, 1. However, every year a portion of firms either fails to meet the deadline or submits corrections afterwards. Filing continues up to the very end of the year but after the end of April this stream quickly thins out. Nevertheless, there is obviously a trade-off between minimization of data completeness and version availability. We find it a reasonable compromise to query new data in early June, since on average by the end of May 96.7% statements are already filed, including 86.4% of all the correcting filings. We plan to make a new version of RFSD available by July.
Licence
Creative Commons License Attribution 4.0 International (CC BY 4.0).
Copyright © the respective contributors.
Citation
Please cite as:
@unpublished{bondarkov2025rfsd, title={{R}ussian {F}inancial {S}tatements {D}atabase}, author={Bondarkov, Sergey and Ledenev, Victor and Skougarevskiy, Dmitriy}, note={arXiv preprint arXiv:2501.05841}, doi={https://doi.org/10.48550/arXiv.2501.05841}, year={2025}}
Acknowledgments and Contacts
Data collection and processing: Sergey Bondarkov, sbondarkov@eu.spb.ru, Viktor Ledenev, vledenev@eu.spb.ru
Project conception, data validation, and use cases: Dmitriy Skougarevskiy, Ph.D.,
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
DATA TABLES FOR RUNNING THE SCRIPTS"Dataset.csv" : this file contains the ASV table refined from the dataset generated by Tournayre et al. 2025. The refinement process involved retaining 1) only the 2022 data, and 2) only birds (excluding exotic caged birds such as budgerigars, cockatiels, and grey parrots), mammals (including domestic ones), insects, non-algal plants, and fungi (except for two phyla represented by three ASVs that were rarely detected). The columns "Supergroup", "group", "phylum", "class", "Order", "Family", "Genus", and "species" provide the detailed taxonomy of each ASV. The column "FINAL_ID" indicates the final ASV identification at the most precise taxonomic level, using the following prefixes: "c:" for class, "p:" for phylum, "o:" for order, "f:" for family, "g:" for genus, and "s:" for species. Sample names are coded as follows: the first four letters represent the site, followed by the sampling year and month. For example, "AUCH_2022_apr" corresponds to a sample collected at Auchencorth Moss in April 2022."Dataset_no-domestic-mammals.csv" : Same dataset as "Dataset.csv" described above, minus the domestic mammals: rabbit, sheep, horse, donkey, cat, dog, goat and pig."CorineMetal.csv": this dataset contains CORINE Land Cover 2018 surface data (Cole et al., 2021, add reference) and heavy metal pollutant concentrations (© Crown 2025) used to perform the land cover – pollution PCA. Land cover categories were aggregated into five main types based on the CORINE land cover classification: Artificial surfaces (codes: 111, 112, 121–124, 131–133, 141, 142), Agricultural areas (codes: 211, 222, 231, 242, 243), Forest and semi-natural areas (codes: 311–313, 321–324, 331, 333), Wetlands (codes: 411, 412, 421, 423), Water bodies (codes: 511, 512, 522, 523). Pollutant data for 2022 were obtained from the DEFRA UK AIR Information Resource (https://uk-air.defra.gov.uk/data/) (© Crown 2025). These include the average concentrations (ng/m³) of metals primarily emitted from fossil fuel combustion and industrial processes: arsenic (As), cadmium (Cd), cobalt (Co), copper (Cu), iron (Fe), manganese (Mn), nickel (Ni), lead (Pb), selenium (Se), vanadium (V), and zinc (Zn). The dataset provides site-averaged values for each metal. The "Environment_type" column specifies the environment type of UK Heavy Metal air quality monitoring sites, as defined on the UK AIR website (https://uk-air.defra.gov.uk/networks/site-types)."hfp2013_merisINT.tif" : a raster file used to extract average values from the 2013 Human Footprint maps (Williams et al., 2020)."BetaComp_output_nodomestic.csv" : a dataset containing beta diversity component outputs (total beta diversity "BDTotal", turnover, and nestedness) calculated for each site and urban vs. rural environments using the beta.div.comp function (coeff = "BJ", Baselga Jaccard-based) from the adespatial R package (Dray et al., 2024). This dataset is used to generate Figure S3. Note that it excludes domestic mammal species."Site_coordinates_info.csv" : a table providing site details, including: Full site name ("Site"), Site code ("SITE_ID"), UK-AIR Station ID ("UK-AIR ID"), Geographic coordinates ("Latitude", "Longitude"), Label position adjustments for visualization ("nudge_x", "nudge_y"), Environment type (Urban vs. Rural) based on the UK AIR Information Resource.R SCRIPTS TO RUN THE ANALYSES"Analyses.R" : this script performs the alpha and beta diversity analyses. Note that plots for different taxonomic groups were generated separately and later manually combined into the final figures."Make_The_Map.R" : this script generates the map used in the top panel of Figure 1.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the data, the R and Stan scripts used to implement the analyses in the manuscript “High response diversity and conspecific density-dependence, not species interactions, drive dynamics of coral reef fish communities”. There are 27 files in total. We did not include the model fits as they are big files (~15-17GB). We ran the model fits on the HPC at James Cook University.LTMP data and data wranglingLTMP_data_1995_2005_prop_zero_20sp.RData: File containing 25 columns. The first column is Year and it contains the year for each observation in the dataset. The second column Reef contains the reef name, while the latitude and longitude are collected in the 3rd column called Reef_lat and 4th column called Reef_long, respectively. The fifth column is called Shelf and contains the reef shelf position as I for Inner shelf positioning, M for Middle Shelf positioning and O for outer Shelf positioning. The rest of the columns contain the counts of the 20 species with the lowest proportion of zeros in the LTMP data.LTMP_data_1995_2005_prop_zero_40sp.Rdata: File containing 45 columns. The data structure is the same as the file above, but this includes the abundances for the 40 species with the lowest proportion of zeros in the LTMP data.dw_Pomacentrids_20sp.R and dw_Pomacentrids_40sp.R scripts order species in pomacentrids and non-pomacentrids so the models can be fitted to the data. These files produce the data files LTMP_data_1995_2005_prop_zero_20sp_Pomacentrids.Rdata and LTMP_data_1995_2005_prop_zero_40sp_Pomacentrids.RdataModel fittingLTMP_fit_40sp_D2.R, LTMP_fit_40sp_D4.R, LTMP_fit_40sp_D6.R, LTMP_fit_40sp_D8.R, LTMP_fit_40sp_D10.R and LTMP_fit_40sp_D12.R are scripts that fit the models with 2, 4, 6, 8, 10 and 12 latent variables to the LTMP dataset with 40 species.LTMP_fit_20sp.R is a script that fits the model (2 latent variables) to the LTMP dataset with 20 species.LTMP_fit_20sp_randReef_prior.R is a script that fits the model (2 latent variables) to the LTMP dataset with 20 species and also includes random effects at the reef level.Cross-validationpoilog_D2.R, poilog_D4.R, poilog_D6.R, poilog_D8.R, poilog_D10.R and poilog_D12.R estimate the likelihood for the model fit with 2, 4, 6, 8, 10 and 12 latent variables, respectively.Model_comparison_TableS2: Produces the Table S2 in the appendix, which shows the model comparison based on the ELPD.Stan modelsMARPLN_RHS_LV_Pomacentrids.stan: Stan script for the multivariate autoregressive Poisson-Lognormal model with the regularised horseshoe prior and latent variables fitted to the LTMP data.MARPLN_RHS_LV_NonPomacentrids_Rrand_prior.stan: Stan script for the multivariate autoregressive Poisson-Lognormal model with the regularised horseshoe prior and latent variables. This model was fitted to non-pomacentrid species, and it included reef random effects for the intrinsic growth rates and the intraspecific density dependence. It also has tighter priors to ensure convergence.FiguresFigure 3.R, Figure 4.R, Figure 5.R and Figure 6.R produce the corresponding figures in the main text of the manuscript.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset provides a comprehensive representation of Great Britain’s (GB) electricity transmission network. It is designed for researchers, engineers, and policymakers focusing on power system analysis, network planning, and the integration of renewable energy technologies.The dataset is organised into multiple sheets, with each sheet representing a specific type of network device or component. The rows in each sheet correspond to individual devices, and the columns provide detailed specifications and attributes for those devices. Below is an overview of the dataset structure:Terminals:Columns include: Name, Voltage (kV), Latitude, Longitude.Details the geographical and electrical characteristics of network nodes.Transmission Lines:Columns include: Terminal i, Terminal j, Length (km), R1 (Ω), X1 (Ω), Rated Voltage (kV), Nominal Current (kA), B1 (µS), C1 (µF), B0 (µS), C0 (µF).Captures physical and electrical properties of transmission lines.Transformers:Columns include: Name, HV/LV sides, Snom (MVA), Tap Position, r (p.u.), x (p.u.), b (p.u.), r0 (p.u.), x0 (p.u.).Provides technical specifications and operational parameters of transformers.Static Generators:Columns include: Name, Plant Category, Subcategory, Active Power (MW), Apparent Power (MVA), Maximum Power (MW), Maximum Reactive Power Limit (Mvar).Includes information about renewable and non-synchronous generation sources.Synchronous Generators:Columns include: Name, Plant Category, Subcategory, Active Power (MW), Apparent Power (MVA), Max Active Power Limit (MW), Max Reactive Power Limit (Mvar).Describes synchronous generation sources.Demand:Columns include: Terminal, Active Power (MW), Reactive Power (Mvar).Specifies electricity demand at different nodes for the peak winter 2024.Shunt Devices:Columns include: Name, Type, Rated Voltage (kV), Rated Reactive Power (Mvar), Upper Voltage Limit (p.u.), Lower Voltage Limit (p.u.).Represents devices used for voltage regulation and reactive power compensation.SVC (Static VAR Compensators):Columns include: Terminal, Q of Reactance (>0) (Mvar), Rated Reactive Power (Mvar), Target Voltage (kV).Details reactive power management devices.This dataset was created using validated models and publicly available data. It serves as a foundation for power system studies, including grid reliability analysis, congestion management, renewable integration scenarios, and etc.The research results derived from this dataset have been published in the Applied Energy Journal in the paper titled "Addressing Electricity Transmission Network Congestions Using Battery Energy Storage Systems – A Case Study of Great Britain." This publication explores the potential of battery energy storage systems to address congestion challenges in GB’s transmission network.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These data include station ID, collection date, vessel, time, location, water depth, salinity, and temperature from cruises in the Albemarle-Pamlico Estuarine System, Coastal North Carolina, at the Western Edge of Gulf Stream in 2018 and 2019.
These samples were collected in coastal North Carolina to investigate the impacts of the 2018 hurricane season.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
These Kaggle datasets offer a comprehensive analysis of the US real estate market, leveraging data sourced from Redfin via an unofficial API. It contains weekly snapshots stored in CSV files, reflecting the dynamic nature of property listings, prices, and market trends across various states and cities, except for Wyoming, Montana, and North Dakota, and with specific data generation for Texas cities. Notably, the dataset includes a prepared version, USA_clean_unique, which has undergone initial cleaning steps as outlined in the thesis. These datasets were part of my thesis; other two countries were France and UK.
These steps include: - Removal of irrelevant features for statistical analysis. - Renaming variables for consistency across international datasets. - Adjustment of variable value ranges for a more refined analysis.
Unique aspects such as Redfin’s “hot” label algorithm, property search status, and detailed categorizations of property types (e.g., single-family residences, condominiums/co-ops, multi-family homes, townhouses) provide deep insights into the market. Additionally, external factors like interest rates, stock market volatility, unemployment rates, and crime rates have been integrated to enrich the dataset and offer a multifaceted view of the real estate market's drivers.
The USA_clean_unique dataset represents a key step before data normalization/trimming, containing variables both in their raw form and categorized based on predefined criteria, such as property size, year of construction, and number of bathrooms/bedrooms. This structured approach aims to capture the non-linear relationships between various features and property prices, enhancing the dataset's utility for predictive modeling and market analysis.
See columns from USA_clean_unique.csv and my Thesis (Table 2.8) for exact column descriptions.
Table 2.4 and Section 2.2.3, which I refer to in the column descriptions, can be found in my thesis; see University Library. Click on Online Access->Hlavni prace.
If you want to continue generating datasets yourself, see my Github Repository for code inspiration.
Let me know if you want to see how I got from raw data to USA_clean_unique.csv. Multiple steps include cleaning in Tableau Prep and R, downloading and merging external variables to the dataset, removing duplicates, and renaming columns for consistency.
Facebook
TwitterThis dataset reports CTD and water column profiles near the GC-600 natural oil seep in the Gulf of Mexico aboard R/V Point Sur cruise PS19_14 in the Gulf of Mexico from 2019-01-26 to 2019-01-28. Water column profiles were collected at two sampling sites (GC-699, and GC-600) and the data include CTD, dissolved oxygen, Chlorophyll-a Dissolved Organic Matter (DOM) fluorescence (fDOM), altimeter, Photosynthetically Available Radiation (PAR) and Surficial PAR (SPAR). During this cruise, we performed a series of CTD casts in and around both sites to constrain water column structure and radium isotopes. The main objective of the cruise was to use the ROV Odysseus from Pelagic Research Services to directly sample hydrocarbons emanating from MegaPlume at GC-600 (27° 22.199'N 90° 34.262'W). During our time at sea, we further aimed to sample radium isotopes in the water column profiles. The R/V Point Sur cruise PS19_14 was led by chief scientist Dr. Richard Peterson.
Facebook
TwitterVersion 5 release notes:
Removes support for SPSS and Excel data.Changes the crimes that are stored in each file. There are more files now with fewer crimes per file. The files and their included crimes have been updated below.
Adds in agencies that report 0 months of the year.Adds a column that indicates the number of months reported. This is generated summing up the number of unique months an agency reports data for. Note that this indicates the number of months an agency reported arrests for ANY crime. They may not necessarily report every crime every month. Agencies that did not report a crime with have a value of NA for every arrest column for that crime.Removes data on runaways.
Version 4 release notes:
Changes column names from "poss_coke" and "sale_coke" to "poss_heroin_coke" and "sale_heroin_coke" to clearly indicate that these column includes the sale of heroin as well as similar opiates such as morphine, codeine, and opium. Also changes column names for the narcotic columns to indicate that they are only for synthetic narcotics.
Version 3 release notes:
Add data for 2016.Order rows by year (descending) and ORI.Version 2 release notes:
Fix bug where Philadelphia Police Department had incorrect FIPS county code.
The Arrests by Age, Sex, and Race data is an FBI data set that is part of the annual Uniform Crime Reporting (UCR) Program data. This data contains highly granular data on the number of people arrested for a variety of crimes (see below for a full list of included crimes). The data sets here combine data from the years 1980-2015 into a single file. These files are quite large and may take some time to load.
All the data was downloaded from NACJD as ASCII+SPSS Setup files and read into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. For the R code used to clean this data, see here. https://github.com/jacobkap/crime_data. If you have any questions, comments, or suggestions please contact me at jkkaplan6@gmail.com.
I did not make any changes to the data other than the following. When an arrest column has a value of "None/not reported", I change that value to zero. This makes the (possible incorrect) assumption that these values represent zero crimes reported. The original data does not have a value when the agency reports zero arrests other than "None/not reported." In other words, this data does not differentiate between real zeros and missing values. Some agencies also incorrectly report the following numbers of arrests which I change to NA: 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 99999, 99998.
To reduce file size and make the data more manageable, all of the data is aggregated yearly. All of the data is in agency-year units such that every row indicates an agency in a given year. Columns are crime-arrest category units. For example, If you choose the data set that includes murder, you would have rows for each agency-year and columns with the number of people arrests for murder. The ASR data breaks down arrests by age and gender (e.g. Male aged 15, Male aged 18). They also provide the number of adults or juveniles arrested by race. Because most agencies and years do not report the arrestee's ethnicity (Hispanic or not Hispanic) or juvenile outcomes (e.g. referred to adult court, referred to welfare agency), I do not include these columns.
To make it easier to merge with other data, I merged this data with the Law Enforcement Agency Identifiers Crosswalk (LEAIC) data. The data from the LEAIC add FIPS (state, county, and place) and agency type/subtype. Please note that some of the FIPS codes have leading zeros and if you open it in Excel it will automatically delete those leading zeros.
I created 9 arrest categories myself. The categories are:
Total Male JuvenileTotal Female JuvenileTotal Male AdultTotal Female AdultTotal MaleTotal FemaleTotal JuvenileTotal AdultTotal ArrestsAll of these categories are based on the sums of the sex-age categories (e.g. Male under 10, Female aged 22) rather than using the provided age-race categories (e.g. adult Black, juvenile Asian). As not all agencies report the race data, my method is more accurate. These categories also make up the data in the "simple" version of the data. The "simple" file only includes the above 9 columns as the arrest data (all other columns in the data are just agency identifier columns). Because this "simple" data set need fewer columns, I include all offenses.
As the arrest data is very granular, and each category of arrest is its own column, there are dozens of columns per crime. To keep the data somewhat manageable, there are nine different files, eight which contain different crimes and the "simple" file. Each file contains the data for all years. The eight categories each have crimes belonging to a major crime category and do not overlap in crimes other than with the index offenses. Please note that the crime names provided below are not the same as the column names in the data. Due to Stata limiting column names to 32 characters maximum, I have abbreviated the crime names in the data. The files and their included crimes are:
Index Crimes
MurderRapeRobberyAggravated AssaultBurglaryTheftMotor Vehicle TheftArsonAlcohol CrimesDUIDrunkenness
LiquorDrug CrimesTotal DrugTotal Drug SalesTotal Drug PossessionCannabis PossessionCannabis SalesHeroin or Cocaine PossessionHeroin or Cocaine SalesOther Drug PossessionOther Drug SalesSynthetic Narcotic PossessionSynthetic Narcotic SalesGrey Collar and Property CrimesForgeryFraudStolen PropertyFinancial CrimesEmbezzlementTotal GamblingOther GamblingBookmakingNumbers LotterySex or Family CrimesOffenses Against the Family and Children
Other Sex Offenses
ProstitutionRapeViolent CrimesAggravated AssaultMurderNegligent ManslaughterRobberyWeapon Offenses
Other CrimesCurfewDisorderly ConductOther Non-trafficSuspicion
VandalismVagrancy
Simple
This data set has every crime and only the arrest categories that I created (see above).
If you have any questions, comments, or suggestions please contact me at jkkaplan6@gmail.com.