53 datasets found

France Weekly Real Estate Listings 2022-2023
kaggle.com
zip
Updated Apr 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Artur Dragunov (2024). France Weekly Real Estate Listings 2022-2023 [Dataset]. https://www.kaggle.com/datasets/arturdragunov/france-weekly-real-estate-listings-2022-2023
Explore at:
zip(2750497 bytes)Available download formats
Dataset updated
Apr 3, 2024
Authors
Artur Dragunov
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Area covered
France
Description
These Kaggle datasets provide downloaded real estate listings from the French real estate market, capturing data from a leading platform in France (Seloger), reminiscent of the approach taken for the US dataset from Redfin and UK dataset from Zoopla. It encompasses detailed property listings, pricing, and market trends across France, stored in weekly CSV snapshots. The cleaned and merged version of all the snapshots is named as France_clean_unique.csv.

The cleaning process mirrored that of the US dataset, involving removing irrelevant features, normalizing variable names for dataset consistency with USA and UK, and adjusting variable value ranges to get rid of extreme outliers. To augment the dataset's depth, external factors like inflation rates, stock market volatility, and macroeconomic indicators have been integrated, offering a multifaceted perspective on France's real estate market drivers.

For exact column descriptions, see columns for France_clean_unique.csv and my thesis.

Table 2.5 and Section 2.2.1, which I refer to in the column descriptions, can be found in my thesis; see University Library. Click on Online Access->Hlavni prace.

If you want to continue generating datasets yourself, see my Github Repository for code inspiration.

Let me know if you want to see how I got from raw data to France_clean_unique.csv. There are multiple steps, including cleaning Tableau Prep and R, downloading and merging external variables to the dataset, removing duplicates, and renaming some columns.
Supplementary data and code 1 for "Significant shifts in latitudinal optima...
figshare.com
zip
Updated Apr 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Paulo Mateus Martins (2024). Supplementary data and code 1 for "Significant shifts in latitudinal optima of North American birds" (PNAS) [Dataset]. http://doi.org/10.6084/m9.figshare.24881544.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24881544.v1
Dataset updated
Apr 1, 2024
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Paulo Mateus Martins
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Significant shifts in latitudinal optima of North American birds (PNAS)Paulo Mateus Martins, Marti J. Anderson, Winston L. Sweatman, and Andrew J. PunnettOverviewThis file contains the raw 2022 release of the North American breeding bird survey dataset (Ziolkowski Jr et al. 2022), as well as the filtered version used in our paper and the code that generated it. We also included code for using BirdLife's species distribution shapefiles to classify species as eastern or western based on their occurrence in the BBS dataset and to calculated the percentage of their range covered by the BBS sampling extent. Note that this code requires species distribution shapefiles, which are not provided but can be obtained directly from https://datazone.birdlife.org/species/requestdis.ReferenceD. J. Ziolkowski Jr., M. Lutmerding, V. I. Aponte, M. A. R. Hudson, North American breeding bird survey dataset 1966–2021: U.S. Geological Survey data release (2022), https://doi.org/10.5066/P97WAZE5Detailed file descriptioninfo_birds_names_shp: A data frame that links BBS species names (column Species) to shapefiles (column Species_BL). See the code2_sampling coverage.dat_raw_BBS_data_v2022: This R environment contains the raw BBS data from the 2022 release (https://www.sciencebase.gov/catalog/item/625f151ed34e85fa62b7f926). This object contains data frames created with the files "Routes.zip" (route information), "SpeciesList.txt" (bird taxonomy), and "50-StopData.zip" (actual counts per route and year). This object is the starting point for creating the dataset used in the paper, which was filtered to remove taxonomic uncertainties, as demonstrated in the "code1_build_long_wide_datasets" R script.code1_build_long_wide_datasets: This code filters the original dataset (dat_raw_BBS_data_v2022) to remove taxonomic uncertainties, assigns routes as either eastern or western based on regionalization using the dynamically constrained agglomerative clustering and partitioning method (see the Methods section of the paper), and generates the full long and wide versions of the dataset used in the analyses (dat2_filtered_data_long, dat3_filtered_data_wide).dat2_filtered_data_long: The filtered raw dataset in long form. This dataset was further filtered to remove nocturnal and aquatic species, as well as species with fewer than 30 occurrences, but the complete version is available here. To obtain the exact subset used in the analysis, filter this dataset using the column Species from datasets S1 or S3.dat3_filtered_data_wide: The filtered raw dataset in its widest form. This dataset was further filtered to remove nocturnal and aquatic species, as well as species with fewer than 30 occurrences, but the complete version is available here. To obtain the exact subset used in the analysis, filter this dataset using the column Species from datasets S1 or S3.code2_sampling coverage: This code determines how much of a bird distribution is covered by the BBS sampling extent (refer to Dataset S1). It is important to note that this script requires bird species distribution shapefiles from BirdLife International, which we are not permitted to share. The shapefiles can be requested directly at https://datazone.birdlife.org/species/requestdis
d
Data from: Water Temperature of Lakes in the Conterminous U.S. Using the...
catalog.data.gov
data.usgs.gov
+1more
Updated Nov 13, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Water Temperature of Lakes in the Conterminous U.S. Using the Landsat 8 Analysis Ready Dataset Raster Images from 2013-2023 [Dataset]. https://catalog.data.gov/dataset/water-temperature-of-lakes-in-the-conterminous-u-s-using-the-landsat-8-analysis-ready-2013
Explore at:
Dataset updated
Nov 13, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Contiguous United States, United States
Description
This data release contains lake and reservoir water surface temperature summary statistics calculated from Landsat 8 Analysis Ready Dataset (ARD) images available within the Conterminous United States (CONUS) from 2013-2023. All zip files within this data release contain nested directories using .parquet files to store the data. The file example_script_for_using_parquet.R contains example code for using the R arrow package (Richardson and others, 2024) to open and query the nested .parquet files. Limitations with this dataset include: - All biases inherent to the Landsat Surface Temperature product are retained in this dataset which can produce unrealistically high or low estimates of water temperature. This is observed to happen, for example, in cases with partial cloud coverage over a waterbody. - Some waterbodies are split between multiple Landsat Analysis Ready Data tiles or orbit footprints. In these cases, multiple waterbody-wide statistics may be reported - one for each data tile. The deepest point values will be extracted and reported for tile covering the deepest point. A total of 947 waterbodies are split between multiple tiles (see the multiple_tiles = “yes” column of site_id_tile_hv_crosswalk.csv). - Temperature data were not extracted from satellite images with more than 90% cloud cover. - Temperature data represents skin temperature at the water surface and may differ from temperature observations from below the water surface. Potential methods for addressing limitations with this dataset: - Identifying and removing unrealistic temperature estimates: - Calculate total percentage of cloud pixels over a given waterbody as: percent_cloud_pixels = wb_dswe9_pixels/(wb_dswe9_pixels + wb_dswe1_pixels), and filter percent_cloud_pixels by a desired percentage of cloud coverage. - Remove lakes with a limited number of water pixel values available (wb_dswe1_pixels < 10) - Filter waterbodies where the deepest point is identified as water (dp_dswe = 1) - Handling waterbodies split between multiple tiles: - These waterbodies can be identified using the "site_id_tile_hv_crosswalk.csv" file (column multiple_tiles = “yes”). A user could combine sections of the same waterbody by spatially weighting the values using the number of water pixels available within each section (wb_dswe1_pixels). This should be done with caution, as some sections of the waterbody may have data available on different dates. All zip files within this data release contain nested directories using .parquet files to store the data. The example_script_for_using_parquet.R contains example code for using the R arrow package to open and query the nested .parquet files. - "year_byscene=XXXX.zip" – includes temperature summary statistics for individual waterbodies and the deepest points (the furthest point from land within a waterbody) within each waterbody by the scene_date (when the satellite passed over). Individual waterbodies are identified by the National Hydrography Dataset (NHD) permanent_identifier included within the site_id column. Some of the .parquet files with the byscene datasets may only include one dummy row of data (identified by tile_hv="000-000"). This happens when no tabular data is extracted from the raster images because of clouds obscuring the image, a tile that covers mostly ocean with a very small amount of land, or other possible. An example file path for this dataset follows: year_byscene=2023/tile_hv=002-001/part-0.parquet -"year=XXXX.zip" – includes the summary statistics for individual waterbodies and the deepest points within each waterbody by the year (dataset=annual), month (year=0, dataset=monthly), and year-month (dataset=yrmon). The year_byscene=XXXX is used as input for generating these summary tables that aggregates temperature data by year, month, and year-month. Aggregated data is not available for the following tiles: 001-004, 001-010, 002-012, 028-013, and 029-012, because these tiles primarily cover ocean with limited land, and no output data were generated. An example file path for this dataset follows: year=2023/dataset=lakes_annual/tile_hv=002-001/part-0.parquet - "example_script_for_using_parquet.R" – This script includes code to download zip files directly from ScienceBase, identify HUC04 basins within desired landsat ARD grid tile, download NHDplus High Resolution data for visualizing, using the R arrow package to compile .parquet files in nested directories, and create example static and interactive maps. - "nhd_HUC04s_ingrid.csv" – This cross-walk file identifies the HUC04 watersheds within each Landsat ARD Tile grid. -"site_id_tile_hv_crosswalk.csv" - This cross-walk file identifies the site_id (nhdhr{permanent_identifier}) within each Landsat ARD Tile grid. This file also includes a column (multiple_tiles) to identify site_id's that fall within multiple Landsat ARD Tile grids. - "lst_grid.png" – a map of the Landsat grid tiles labelled by the horizontal – vertical ID.
Z
Seshat-NLP Dataset Pre-Release
data.niaid.nih.gov
zenodo.org
Updated Mar 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hauser, Jakob; del Rio Chanona, R. Maria (2024). Seshat-NLP Dataset Pre-Release [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10829961
Explore at:
Dataset updated
Mar 18, 2024
Dataset provided by
University College London
Complexity Science Hub Vienna
Authors
Hauser, Jakob; del Rio Chanona, R. Maria
Description
This is a pre-release of Seshat-NLP, a dataset of labelled text segments derived from the Seshat Databank. These text segments were originally used in the Seshat Databank to justify the coding of historical "facts". A data point in the Seshat Databank would describe a property of a past society at a certain time (-range). We use these data points with their textual justifications to extract a NLP dataset of text segments accompanied by topic labels.

General Overview

The Dataset is organised around unique text segments (i.e.: each row one unique segment), these segments are connected with labels that designate the historical information that is contained within the text. Each segment has at least one 4-tuple of labels associated with it but can have more. The labels are ("variable_name", "variable_id", "value", and "polity_id").

Below is a simplified example row in our dataset (exemplary data!):

Description Labels ("variable", "var_id", "value", "polity") Reference

Thebes was the capital … [("Capital", "…","Thebes", "Egypt Middle Kingdom"),…]

{"Title" : "The Oxford Encyclopedia of …", "Author" : "…", "DOI" : "…", …}

Note on Source Literature Text Segments

Our dataset partially consists of segments taken from scientific literature on history, we also pair these segments with labels that denote their content. We are currently looking into the legal considerations of releasing such data. In the meanwhile, we have added information to our dataset that allows the identification of the pertaining documents for each description.

In Depth Explanation of the Dataset

List of files in the release:

Seshat_NLP.sql

This file is a PostgreSQL dump that can be used to instantiate the PostgreSQL table with all the data.The table zenodoexport has the following columns:

Column Name Column Description

id row identifier

description textual justification of coded value

labels labels for description

reference_information information required to retrieve documents

description_hash utility column

zodero_id utility column

Hierarchy_graph.gexf

The hierarchy_graph.gexf file is a xml based export of the hierarchy graph that can be used to tie variables to their hierarchical position in the Seshat codebook.

Explanation of Labels Column

The labels column contains a list of 4-tuples which in order denote "variable_name", "variable_id", "value", and "polity_id".We use this structure to allow for a single segment/description to have multiple 4-tuples of labels, this is useful when the same of description has been used to justify multiple "facts" in the original Seshat Databank.The variable_ids can be used to tie variable labels to nodes in the hierarchy of the Seshat codebook.
UK Weekly Real Estate Listings 2022-2023
kaggle.com
zip
Updated Apr 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Artur Dragunov (2024). UK Weekly Real Estate Listings 2022-2023 [Dataset]. https://www.kaggle.com/datasets/arturdragunov/uk-weekly-real-estate-listings-2022-2023
Explore at:
zip(29112488 bytes)Available download formats
Dataset updated
Apr 3, 2024
Authors
Artur Dragunov
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Area covered
United Kingdom
Description
These Kaggle datasets provide downloaded real-estate listings from the UK real estate market, capturing data from a leading platform in the UK (Zoopla), reminiscent of the approach taken for the US dataset from Redfin and French dataset from Seloger. It encompasses detailed property listings, pricing, and market trends across UK, stored in weekly CSV snapshots. The cleaned and merged version of all the snapshots is named as UK_clean_unique.csv.

The cleaning process mirrored that of the US and French datasets, involving removing irrelevant features, normalizing variable names for dataset consistency with the USA and France, and adjusting variable value ranges to get rid of extreme outliers. To augment the dataset's depth, external factors like inflation rates, stock market volatility, and macroeconomic indicators have been integrated, offering a multifaceted perspective on the UK's real estate market drivers.

For exact column descriptions, see columns for UK_clean_unique.csv and my thesis.

Table 2.6 and Section 2.2.2, which I refer to in the column descriptions, can be found in my thesis; see University Library. Click on Online Access->Hlavni prace.

If you want to continue generating datasets yourself, see my Github Repository for code inspiration.

Let me know if you want to see how I got from raw data to France_clean_unique.csv. There are multiple steps, including cleaning Tableau Prep and R, downloading and merging external variables to the dataset, removing duplicates, and renaming some columns.
Z
Film Circulation dataset
data.niaid.nih.gov
zenodo.org
Updated Jul 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Loist, Skadi; Samoilova, Evgenia (Zhenya) (2024). Film Circulation dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7887671
Explore at:
Dataset updated
Jul 12, 2024
Dataset provided by
Film University Babelsberg KONRAD WOLF
Authors
Loist, Skadi; Samoilova, Evgenia (Zhenya)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

Please cite this when using the dataset.

Detailed description of the dataset:

1 Film Dataset: Festival Programs

The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.

2 Survey Dataset

The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.

3 IMDb & Scripts

The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.

4 Festival Library Dataset

The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories, units of measurement, data sources and coding and missing data.

The csv file “4_festival-library_dataset_imdb-and-survey” contains data on all unique festivals collected from both IMDb and survey sources. This dataset appears in wide format, all information for each festival is listed in one row. This
Z
Data from: A FAIR and modular image-based workflow for knowledge discovery...
data.niaid.nih.gov
zenodo.org
Updated Jul 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Meghan Balk; Thibault Tabarin; John Bradley; Hilmar Lapp (2024). Data from: A FAIR and modular image-based workflow for knowledge discovery in the emerging field of imageomics [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8233379
Explore at:
Dataset updated
Jul 7, 2024
Dataset provided by
National Ecological Observatory Network
Duke University School of Medicine
Authors
Meghan Balk; Thibault Tabarin; John Bradley; Hilmar Lapp
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data and results from the Imageomics Workflow. These include data files from the Fish-AIR repository (https://fishair.org/) for purposes of reproducibility and outputs from the application-specific imageomics workflow contained in the Minnow_Segmented_Traits repository (https://github.com/hdr-bgnn/Minnow_Segmented_Traits).

Fish-AIR: This is the dataset downloaded from Fish-AIR, filtering for Cyprinidae and the Great Lakes Invasive Network (GLIN) from the Illinois Natural History Survey (INHS) dataset. These files contain information about fish images, fish image quality, and path for downloading the images. The data download ARK ID is dtspz368c00q. (2023-04-05). The following files are unaltered from the Fish-AIR download. We use the following files:

extendedImageMetadata.csv: A CSV file containing information about each image file. It has the following columns: ARKID, fileNameAsDelivered, format, createDate, metadataDate, size, width, height, license, publisher, ownerInstitutionCode. Column definitions are defined https://fishair.org/vocabulary.html and the persistent column identifiers are in the meta.xml file.

imageQualityMetadata.csv: A CSV file containing information about the quality of each image. It has the following columns: ARKID, license, publisher, ownerInstitutionCode, createDate, metadataDate, specimenQuantity, containsScaleBar, containsLabel, accessionNumberValidity, containsBarcode, containsColorBar, nonSpecimenObjects, partsOverlapping, specimenAngle, specimenView, specimenCurved, partsMissing, allPartsVisible, partsFolded, brightness, uniformBackground, onFocus, colorIssue, quality, resourceCreationTechnique. Column definitions are defined https://fishair.org/vocabulary.html and the persistent column identifiers are in the meta.xml file.

multimedia.csv: A CSV file containing information about image downloads. It has the following columns: ARKID, parentARKID, accessURI, createDate, modifyDate, fileNameAsDelivered, format, scientificName, genus, family, batchARKID, batchName, license, source, ownerInstitutionCode. Column definitions are defined https://fishair.org/vocabulary.html and the persistent column identifiers are in the meta.xml file.

meta.xml: A XML file with the metadata about the column indices and URIs for each file contained in the original downloaded zip file. This file is used in the fish-air.R script to extract the indices for column headers.

The outputs from the Minnow_Segmented_Traits workflow are:

sampling.df.seg.csv: Table with tallies of the sampling of image data per species during the data cleaning and data analysis. This is used in Table S1 in Balk et al.

presence.absence.matrix.csv: The Presence-Absence matrix from segmentation, not cleaned. This is the result of the combined outputs from the presence.json files created by the rule “create_morphological_analysis”. The cleaned version of this matrix is shown as Table S3 in Balk et al.

heatmap.avg.blob.png and heatmap.sd.blob.png: Heatmaps of average area of biggest blob per trait (heatmap.avg.blob.png) and standard deviation of area of biggest blob per trait (heatmap.sd.blob.png). These images are also in Figure S3 of Balk et al.

minnow.filtered.from.iqm.csv: Filtered fish image data set after filtering (see methods in Balk et al. for filter categories).

burress.minnow.sp.filtered.from.iqm.csv: Fish image data set after filtering and selecting species from Burress et al. 2017.
d
Data from: Prediction of Cattle Fever Tick Outbreaks in United States...
catalog.data.gov
datasetcatalog.nlm.nih.gov
+2more
Updated Apr 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Prediction of Cattle Fever Tick Outbreaks in United States Quarantine Zone [Dataset]. https://catalog.data.gov/dataset/prediction-of-cattle-fever-tick-outbreaks-in-united-states-quarantine-zone-efbc3
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Service
Area covered
United States
Description
[NOTE - 11/24/2021: this dataset supersedes an earlier version https://doi.org/10.15482/USDA.ADC/1518654 ] Data sources. Time series data on cattle fever tick incidence, 1959-2020, and climate variables January 1950 through December 2020, form the core information in this analysis. All variables are monthly averages or sums over the fiscal year, October 01 (of the prior calendar year, y-1) through September 30 of the current calendar year (y). Annual records on monthly new detections of Rhipicephalus microplus and R. annulatus (cattle fever tick, CFT) on premises within the Permanent Quarantine Zone (PQZ) were obtained from the Cattle Fever Tick Eradication Program (CFTEP) maintained jointly by the United States Department of Agriculture (USDA), Animal Plant Health Inspection Service and the USDA Animal Research Service in Laredo, Texas. Details of tick survey procedures, CFTEP program goals and history, and the geographic extent of the PQZ are in the main text, and in the Supporting Information (SI) of the associated paper. Data sources on oceanic indicators, on local meteorology, and their pretreatment are detailed in SI. Data pretreatment. To address the low signal-to-noise ratio and non-independence of observations common in time series, we transformed all explanatory and response variables by using a series of six consecutive steps: (i) First differences (year y minus year y-1) were calculated, (ii) these were then converted to z scores (z = (x- Î¼) / Ïƒ, where x is the raw value, Î¼ is the population mean, Ïƒ is the standard deviation of the population), (iii) linear regression was applied to remove any directional trends, (iv) moving averages (typically 11-year point-centered moving averages) were calculated for each variable, (v) a lag was applied if/when deemed necessary, and (vi) statistics calculated (r, n, df, P<, p<). Principal component analysis (PCA). A matrix of z-score first differences of the 13 climate variables, and CFT (1960-2020), was entered into XLSTAT principal components analysis routine; we used Pearson correlation of the 14 x 60 matrix, and Varimax rotation of the first two components. Autoregressive Integrated Moving Average (ARIMA). An ARIMA (2,0,0) model was selected among 7 test models in which the p, d, and q terms were varied, and selection made on the basis of lowest RMSE and AIC statistics, and reduction of partial autocorrelation outcomes. A best model linear regression of CFT values on ARIMA-predicted CFT was developed using XLSTAT linear regression software with the objective of examining statistical properties (r, n, df, P<, p<), including the Durbin-Watson index of order-1 autocorrelation, and Cookâ€™s Di distance index. Cross-validation of the model was made by withholding the last 30, and then the first 30 observations in a pair of regressions. Forecast of the next major CFT outbreak. It is generally recognized that the onset year of the first major CFT outbreak was not 1959, but may have occurred earlier in the decade. We postulated the actual underlying pattern is fully 44 years from the start to the end of a CFT cycle linked to external climatic drivers. (SI Appendix, Hypothesis on CFT cycles). The hypothetical reconstruction was projected one full CFT cycle into the future. To substantiate the projected trend, we generated a power spectrum analysis based on 1-year values of the 1959-2020 CFT dataset using SYSTAT AutoSignal software. The outcome included a forecast to 2100; this was compared to the hypothetical reconstruction and projection. Any differences were noted, and the start and end dates of the next major CFT outbreak identified. Resources in this dataset: Resource Title: CFT and climate data. File Name: climate-cft-data2.csv Resource Description: Main dataset; see data dictionary for information on each column Resource Title: Data dictionary (metadata). File Name: climate-cft-metadata2.csv Resource Description: Information on variables and their origin Resource Title: fitted models. File Name: climate-cft-models2.xlsx Resource Software Recommended: Microsoft Excel,url: https://www.microsoft.com/en-us/microsoft-365/excel; XLSTAT,url: https://www.xlstat.com/en/; SYStat Autosignal,url: https://www.systat.com/products/AutoSignal/
Z
HyG: A hydraulic geometry dataset derived from historical stream gage...
data.niaid.nih.gov
zenodo.org
Updated Feb 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Enzminger, Thomas L.; Minear, J. Toby; Livneh, Ben (2024). HyG: A hydraulic geometry dataset derived from historical stream gage measurements across the conterminous United States [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7868763
Explore at:
Dataset updated
Feb 26, 2024
Dataset provided by
University of Colorado, Boulder
Cooperative Institute for Research in Environmental Sciences
National Center for Atmospheric Research
Authors
Enzminger, Thomas L.; Minear, J. Toby; Livneh, Ben
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Contiguous United States, United States
Description
Regional- and continental-scale models predicting variations in the magnitude and timing of streamflow are important tools for forecasting water availability as well as flood inundation extent and associated damages. Such models must define the geometry of stream channels through which flow is routed. These channel parameters, such as width, depth, and hydraulic resistance, exhibit substantial variability in natural systems. While hydraulic geometry relationships have been extensively studied in the United States, they remain unquantified for thousands of stream reaches across the country. Consequently, large-scale hydraulic models frequently take simplistic approaches to channel geometry parameterization. Over-simplification of channel geometries directly impacts the accuracy of streamflow estimates, with knock-on effects for water resource and hazard prediction.

Here, we present a hydraulic geometry dataset derived from long-term measurements at U.S. Geological Survey (USGS) stream gages across the conterminous United States (CONUS). This dataset includes (a) at-a-station hydraulic geometry parameters following the methods of Leopold and Maddock (1953), (b) at-a-station Manning's n calculated from the Manning equation, (c) daily discharge percentiles, and (d) downstream hydraulic geometry regionalization parameters based on HUC4 (Hydrologic Unit Code 4). This dataset is referenced in Heldmyer et al. (2022); further details and implications for CONUS-scale hydrologic modeling are available in that article (https://doi.org/10.5194/hess-26-6121-2022).

At-a-station Hydraulic Geometry

We calculated hydraulic geometry parameters using historical USGS field measurements at individual station locations. Leopold and Maddock (1953) derived the following power law relationships:

(w={aQ^b})

(d=cQ^f)

(v=kQ^m)

where Q is discharge, w is width, d is depth, v is velocity, and a, b, c, f, k, and m are at-a-station hydraulic geometry (AHG) parameters. We downloaded the complete record of USGS field measurements from the USGS NWIS portal (https://waterdata.usgs.gov/nwis/measurements). This raw dataset includes 4,051,682 individual measurements from a total of 66,841 stream gages within CONUS. Quantities of interest in AHG derivations are Q, w, d, and v. USGS field measurements do not include d--we therefore calculated d using d=A/w, where A is measured channel area. We applied the following quality control (QC) procedures in order to ensure the robustness of AHG parameters derived from the field data:

We considered only measurements which reported Q, v, w and A.

For each gage, we excluded measurements older than the most recent five years, so as to minimize the effects of long-term channel evolution on observed hydraulic geometry relationships.

We excluded gages for which measured Q disagreed with the product of measured velocity and measured area by more than 5%. Gages for which ( Q eq vA) are often tidally influenced and therefore may not conform to expected channel geometry relationships.

Q, v, w, and d from field measurements at each gage were log-transformed. We performed robust linear regressions on the relationships between log(Q) and log(w), log(v), and log(d). AHG parameters were derived from the regressed explanatory variables.

We applied an iterative outlier detection procedure to the linear regression residuals. Values of log-transformed w, v, and d residuals falling outside a three median absolute deviation (MAD) envelope were excluded. Regression coefficients were recalculated and the outlier detection procedure was reapplied until no new outliers were detected.

Gages for which one or more regression had p-values >0.05 were excluded, as the relationships between log-transformed Q and w, v, or d lacked statistical significance.

Gages were omitted if regressed AHG parameters did not fulfill two additional relationships derived by Leopold and Maddock: (b+f+m=1{\displaystyle \pm }0.1) and (a{\displaystyle \times }c{\displaystyle \times }k=1{\displaystyle \pm }0.1).

If the number of field measurements for a given gage was less than 10, either initially or after individual measurements were removed via steps 1-4, the gage was excluded from further analysis.

Application of the QC procedures described above removed 55,328 stream gages, many of which were short-term campaign gages at which very few field measurements had been recorded. We derived AHG parameters for the remaining 11,513 gages which passed our QC.

At-a-station Manning's n

We calculated hydraulic resistance at each gage location by solving Manning's equation for Manning's n, given by

(n = {{R^{2/3}S^{1/2}} \over v})

where v is velocity, R is hydraulic radius and S is longitudinal slope. We used smoothed reach-scale longitudinal slopes from the NHDPlusv2 (National Hydrography Dataset Plus, version 2) ElevSlope data product. We note that NHDPlusv2 contains a minimum slope constraint of 10-5 m/m--no reach may have a slope less than this value. Furthermore, NHDPlusv2 lacks slope values for certain reaches. As such, we could not calculate Manning's n for every gage, and some Manning's n values we report may be inaccurate due to the NHDPlusv2 minimum slope constraint. We report two Manning's n values, both of which take stream depth as an approximation for R. The first takes the median stream depth and velocity measurements from the USGS's database of manual flow measurements for each gage. The second uses stream depth and velocity calculated for a 50th percentile discharge (Q50; see below). Approximating R as stream depth is an assumption which is generally considered valid if the width-to-depth ratio of the stream is greater than 10—which was the case for the vast majority of field measurements. Thus, we report two Manning's n values for each gage, which are each intended to approximately represent median flow conditions.

Daily discharge percentiles

We downloaded full daily discharge records from 16,947 USGS stream gages through the NWIS online portal. The data includes records from both operational and retired gages. Records for operational gages were truncated at the end of the 2018 water year (September 30, 2018) in order to avoid use of preliminary data. To ensure the robustness of daily discharge percentiles, we applied the following QC:

For a given gage, we removed blocks of missing discharge values longer than 6 months. These long blocks of missing data generally correspond to intervals in which a gage was temporarily decommissioned for maintenance.

A gage was omitted from further analysis if its discharge record was less than 10 years (3,652 days) long, and/or less than 90% complete (>10% missing values after removal of long blocks in step 1.

We calculated discharge percentiles for each of the 10,871 gages which passed QC. Discharge percentiles were calculated at increments of 1% between Q1 and Q5, increments of 5% (e.g. Q10, Q15, Q20, etc.) between Q5 and Q95, increments of 1% between Q95 and Q99, and increments of 0.1% between Q99 and Q100 in order to provide higher resolution at the lowest and highest flows, which occur much less frequently.

HG Regionalization

We regionalized AHG parameters from gage locations to all stream reaches in the conterminous United States. This downstream hydraulic geometry regionalization was performed using all gages with AHG parameters in each HUC4, as opposed to traditional downstream hydraulic geometry--which involves interpolation of parameters of interest to ungaged reaches on individual streams. We performed linear regressions on log-transformed drainage area and Q at a number of flow percentiles as follows:

(log(Q_i) = \beta_1log(DA) + \beta_0)

where Qi is streamflow at percentile i, DA is drainage area and (\beta_1) and (\beta_0) are regression parameters. We report (\beta_1), (\beta_0) , and the r2 value of the regression relationship for Q percentiles Q10, Q25, Q50, Q75, Q90, Q95, Q99, and Q99.9. Further discussion and additional analysis of HG regionalization are presented in Heldmyer et al. (2022).

Dataset description

We present the HyG dataset in a comma-separated value (csv) format. Each row corresponds to a different USGS stream gage. Information in the dataset includes gage ID (column 1), gage location in latitude and longitude (columns 2-3), gage drainage area (from USGS; column 4), longitudinal slope of the gage's stream reach (from NHDPlusv2; column 5), AHG parameters derived from field measurements (columns 6-11), Manning's n calculated from median measured flow conditions (column 12), Manning's n calculated from Q50 (column 13), Q percentiles (columns 14-51), HG regionalization parameters and r2 values (columns 52-75), and geospatial information for the HUC4 in which the gage is located (from USGS; columns 76-87). Users are advised to exercise caution when opening the dataset. Certain software, including Microsoft Excel and Python, may drop the leading zeros in USGS gage IDs and HUC4 IDs if these columns are not explicitly imported as strings.

Errata

In version 1, drainage area was mistakenly reported in cubic meters but labeled in cubic kilometers. This error has been corrected in version 2.
Uniform Crime Reporting (UCR) Program Data: Arrests by Age, Sex, and Race,...
search.datacite.org
doi.org
+1more
Updated 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacob Kaplan (2018). Uniform Crime Reporting (UCR) Program Data: Arrests by Age, Sex, and Race, 1980-2016 [Dataset]. http://doi.org/10.3886/e102263v5-10021
Explore at:
Unique identifier
https://doi.org/10.3886/e102263v5-10021
Dataset updated
2018
Dataset provided by
Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
DataCitehttps://www.datacite.org/
Authors
Jacob Kaplan
Description
Version 5 release notes:
Removes support for SPSS and Excel data.Changes the crimes that are stored in each file. There are more files now with fewer crimes per file. The files and their included crimes have been updated below.
Adds in agencies that report 0 months of the year.Adds a column that indicates the number of months reported. This is generated summing up the number of unique months an agency reports data for. Note that this indicates the number of months an agency reported arrests for ANY crime. They may not necessarily report every crime every month. Agencies that did not report a crime with have a value of NA for every arrest column for that crime.Removes data on runaways.
Version 4 release notes:
Changes column names from "poss_coke" and "sale_coke" to "poss_heroin_coke" and "sale_heroin_coke" to clearly indicate that these column includes the sale of heroin as well as similar opiates such as morphine, codeine, and opium. Also changes column names for the narcotic columns to indicate that they are only for synthetic narcotics.
Version 3 release notes:
Add data for 2016.Order rows by year (descending) and ORI.Version 2 release notes:
Fix bug where Philadelphia Police Department had incorrect FIPS county code.
The Arrests by Age, Sex, and Race data is an FBI data set that is part of the annual Uniform Crime Reporting (UCR) Program data. This data contains highly granular data on the number of people arrested for a variety of crimes (see below for a full list of included crimes). The data sets here combine data from the years 1980-2015 into a single file. These files are quite large and may take some time to load.
All the data was downloaded from NACJD as ASCII+SPSS Setup files and read into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. For the R code used to clean this data, see here. https://github.com/jacobkap/crime_data. If you have any questions, comments, or suggestions please contact me at jkkaplan6@gmail.com.

I did not make any changes to the data other than the following. When an arrest column has a value of "None/not reported", I change that value to zero. This makes the (possible incorrect) assumption that these values represent zero crimes reported. The original data does not have a value when the agency reports zero arrests other than "None/not reported." In other words, this data does not differentiate between real zeros and missing values. Some agencies also incorrectly report the following numbers of arrests which I change to NA: 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 99999, 99998.

To reduce file size and make the data more manageable, all of the data is aggregated yearly. All of the data is in agency-year units such that every row indicates an agency in a given year. Columns are crime-arrest category units. For example, If you choose the data set that includes murder, you would have rows for each agency-year and columns with the number of people arrests for murder. The ASR data breaks down arrests by age and gender (e.g. Male aged 15, Male aged 18). They also provide the number of adults or juveniles arrested by race. Because most agencies and years do not report the arrestee's ethnicity (Hispanic or not Hispanic) or juvenile outcomes (e.g. referred to adult court, referred to welfare agency), I do not include these columns.

To make it easier to merge with other data, I merged this data with the Law Enforcement Agency Identifiers Crosswalk (LEAIC) data. The data from the LEAIC add FIPS (state, county, and place) and agency type/subtype. Please note that some of the FIPS codes have leading zeros and if you open it in Excel it will automatically delete those leading zeros.

I created 9 arrest categories myself. The categories are:
Total Male JuvenileTotal Female JuvenileTotal Male AdultTotal Female AdultTotal MaleTotal FemaleTotal JuvenileTotal AdultTotal ArrestsAll of these categories are based on the sums of the sex-age categories (e.g. Male under 10, Female aged 22) rather than using the provided age-race categories (e.g. adult Black, juvenile Asian). As not all agencies report the race data, my method is more accurate. These categories also make up the data in the "simple" version of the data. The "simple" file only includes the above 9 columns as the arrest data (all other columns in the data are just agency identifier columns). Because this "simple" data set need fewer columns, I include all offenses.

As the arrest data is very granular, and each category of arrest is its own column, there are dozens of columns per crime. To keep the data somewhat manageable, there are nine different files, eight which contain different crimes and the "simple" file. Each file contains the data for all years. The eight categories each have crimes belonging to a major crime category and do not overlap in crimes other than with the index offenses. Please note that the crime names provided below are not the same as the column names in the data. Due to Stata limiting column names to 32 characters maximum, I have abbreviated the crime names in the data. The files and their included crimes are:

Index Crimes
MurderRapeRobberyAggravated AssaultBurglaryTheftMotor Vehicle TheftArsonAlcohol CrimesDUIDrunkenness
LiquorDrug CrimesTotal DrugTotal Drug SalesTotal Drug PossessionCannabis PossessionCannabis SalesHeroin or Cocaine PossessionHeroin or Cocaine SalesOther Drug PossessionOther Drug SalesSynthetic Narcotic PossessionSynthetic Narcotic SalesGrey Collar and Property CrimesForgeryFraudStolen PropertyFinancial CrimesEmbezzlementTotal GamblingOther GamblingBookmakingNumbers LotterySex or Family CrimesOffenses Against the Family and Children
Other Sex Offenses
ProstitutionRapeViolent CrimesAggravated AssaultMurderNegligent ManslaughterRobberyWeapon Offenses
Other CrimesCurfewDisorderly ConductOther Non-trafficSuspicion
VandalismVagrancy
Simple
This data set has every crime and only the arrest categories that I created (see above).
If you have any questions, comments, or suggestions please contact me at jkkaplan6@gmail.com.
US Atlantic margin cold seeps database 2011-2016
catalog.data.gov
ncei.noaa.gov
Updated May 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NOAA National Centers for Environmental Information (Point of Contact) (2024). US Atlantic margin cold seeps database 2011-2016 [Dataset]. https://catalog.data.gov/dataset/us-atlantic-margin-cold-seeps-database-2011-2016
Explore at:
Dataset updated
May 1, 2024
Dataset provided by
National Oceanic and Atmospheric Administrationhttp://www.noaa.gov/
National Centers for Environmental Informationhttps://www.ncei.noaa.gov/
Description
U.S. Atlantic margin cold seeps database from multibeam water column imagery, 2011-2016: South Atlantic Bight to Georges Bank. This release includes three data sets that provide geographic positions and water depths for cold seeps identified on the U.S. Atlantic margin between ~50 and ~2650 m water depth from the Blake Ridge on the south to the most seaward part of the southern New England margin on the north. The data were collected on cruises completed by NOAAâ€™s R/V Okeanos Explorer (operated by the Office of Ocean Exploration and Research) or academic research vessels (e.g., R/V Atlantis, R/V Armstrong, and R/V Sikuliaq) between 2011 and 2016. The raw data are available from NCEIâ€™s water column sonar data portal (https://www.ncei.noaa.gov/products/water-column-sonar-data). Details about locating seeps in the raw data are provided in Skarke et al. (2014) and Ruppel et al. (2024). The first data set in this release contains 2052 raw seep identifications, with many places sampled multiple times during subsequent passes on the same cruise or during subsequent cruises. This raw data set therefore includes duplicates, but does report the cruise, trackline, date, and time for the identified water column plume, along with a quality factor and other notes. The second data set, which is the one most useful for plotting unique seep locations, has 1139 seeps culled from the raw dataset by using the density-based clustering with noise algorithms within QGIS software to remove locations that were within 40 m of each other. The third data set applies the same QGIS cluster analysis to the second data set using a search radius of 400 m, yielding 47 seep clusters (5 to 138 seeps each) and 227 identifications outside of clusters. The resulting 274 locations represent the unique seep fields or seep sites along the margin. Details about the analyses are provided in Ruppel et al. (2024). The initial seep identifications used in the raw data set were made by analyzing water column data (WCD) collected by multibeam sonar systems, particularly 30 kHz systems like the EM302. A small amount of the analyzed multibeam data WCD came from EM70 systems.
d
Data from: Data and code from: Stem borer herbivory dependent on...
catalog.data.gov
agdatacommons.nal.usda.gov
+2more
Updated Sep 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Data and code from: Stem borer herbivory dependent on interactions of sugarcane variety, associated traits, and presence of prior borer damage [Dataset]. https://catalog.data.gov/dataset/data-and-code-from-stem-borer-herbivory-dependent-on-interactions-of-sugarcane-variety-ass-1e076
Explore at:
Dataset updated
Sep 2, 2025
Dataset provided by
Agricultural Research Service
Description
This dataset contains all the data and code needed to reproduce the analyses in the manuscript: Penn, H. J., & Read, Q. D. (2023). Stem borer herbivory dependent on interactions of sugarcane variety, associated traits, and presence of prior borer damage. Pest Management Science. https://doi.org/10.1002/ps.7843 Included are two .Rmd notebooks containing all code required to reproduce the analyses in the manuscript, two .html file of rendered notebook output, three .csv data files that are loaded and analyzed, and a .zip file of intermediate R objects that are generated during the model fitting and variable selection process. Notebook files 01_boring_analysis.Rmd: This RMarkdown notebook contains R code to read and process the raw data, create exploratory data visualizations and tables, fit a Bayesian generalized linear mixed model, extract output from the statistical model, and create graphs and tables summarizing the model output including marginal means for different varieties and contrasts between crop years. 02_trait_covariate_analysis.Rmd: This RMarkdown notebook contains R code to read raw variety-level trait data, perform feature selection based on correlations between traits, fit another generalized linear mixed model using traits as predictors, and create graphs and tables from that model output including marginal means by categorical trait and marginal trends by continuous trait. HTML files These HTML files contain the rendered output of the two RMarkdown notebooks. They were generated by Quentin Read on 2023-08-30 and 2023-08-15. 01_boring_analysis.html 02_trait_covariate_analysis.html CSV data files These files contain the raw data. To recreate the notebook output the CSV files should be at the file path project/data/ relative to where the notebook is run. Columns are described below. BoredInternodes_26April2022_no format.csv: primary data file with sugarcane borer (SCB) damage Columns A-C are the year, date, and location. All location values are the same. Column D identifies which experiment the data point was collected from. Column E, Stubble, indicates the crop year (plant cane or first stubble) Column F indicates the variety Column G indicates the plot (integer ID) Column H indicates the stalk within each plot (integer ID) Column I, # Internodes, indicates how many internodes were on the stalk Columns J-AM are numbered 1-30 and indicate whether SCB damage was observed on that internode (0 if no, 1 if yes, blank cell if that internode was not present on the stalk) Column AN indicates the experimental treatment for those rows that are part of a manipulative experiment Column AO contains notes variety_lookup.csv: summary information for the 16 varieties analyzed in this study Column A is the variety name Column B is the total number of stalks assessed for SCB damage for that variety across all years Column C is the number of years that variety is present in the data Column D, Stubble, indicates which crop years were sampled for that variety ("PC" if only plant cane, "PC, 1S" if there are data for both plant cane and first stubble crop years) Column E, SCB resistance, is a categorical designation with four values: susceptible, moderately susceptible, moderately resistant, resistant Column F is the literature reference for the SCB resistance value Select_variety_traits_12Dec2022.csv: variety-level traits for the 16 varieties analyzed in this study Column A is the variety name Column B is the SCB resistance designation as an integer Column C is the categorical SCB resistance designation (see above) Columns D-I are continuous traits from year 1 (plant cane), including sugar (Mg/ha), biomass or aboveground cane production (Mg/ha), TRS or theoretically recoverable sugar (g/kg), stalk weight of individual stalks (kg), stalk population density (stalks/ha), and fiber content of stalk (percent). Columns J-O are the same continuous traits from year 2 (first stubble) Columns P-V are categorical traits (in some cases continuous traits binned into categories): maturity timing, amount of stalk wax, amount of leaf sheath wax, amount of leaf sheath hair, tightness of leaf sheath, whether leaf sheath becomes necrotic with age, and amount of collar hair. ZIP file of intermediate R objects To recreate the notebook output without having to run computationally intensive steps, unzip the archive. The fitted model objects should be at the file path project/ relative to where the notebook is run. intermediate_R_objects.zip: This file contains intermediate R objects that are generated during the model fitting and variable selection process. You may use the R objects in the .zip file if you would like to reproduce final output including figures and tables without having to refit the computationally intensive statistical models. binom_fit_intxns_updated_only5yrs.rds: fitted brms model object for the main statistical model binom_fit_reduced.rds: fitted brms model object for the trait covariate analysis marginal_trends.RData: calculated values of the estimated marginal trends with respect to year and previous damage marginal_trend_trs.rds: calculated values of the estimated marginal trend with respect to TRS marginal_trend_fib.rds: calculated values of the estimated marginal trend with respect to fiber content Resources in this dataset:Resource Title: Sugarcane borer damage data by internode, 1993-2021. File Name: BoredInternodes_26April2022_no format.csvResource Title: Summary information for the 16 sugarcane varieties analyzed. File Name: variety_lookup.csvResource Title: Variety-level traits for the 16 sugarcane varieties analyzed. File Name: Select_variety_traits_12Dec2022.csvResource Title: RMarkdown notebook 2: trait covariate analysis. File Name: 02_trait_covariate_analysis.RmdResource Title: Rendered HTML output of notebook 2. File Name: 02_trait_covariate_analysis.htmlResource Title: RMarkdown notebook 1: main analysis. File Name: 01_boring_analysis.RmdResource Title: Rendered HTML output of notebook 1. File Name: 01_boring_analysis.htmlResource Title: Intermediate R objects. File Name: intermediate_R_objects.zip
m
GIC//NMC Solar Battery Synthetic Data 2 - 45,000 x 18 degradation for...
data.mendeley.com
Updated Aug 11, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matthieu Dubarry (2022). GIC//NMC Solar Battery Synthetic Data 2 - 45,000 x 18 degradation for clear-sky irradiance and cloudy days [Dataset]. http://doi.org/10.17632/7bwjnvvprd.1
Explore at:
Unique identifier
https://doi.org/10.17632/7bwjnvvprd.1
Dataset updated
Aug 11, 2022
Authors
Matthieu Dubarry
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset consists of part 2 of the data associated with publication "Data-driven Direct Diagnosis of PV Connected Batteries "

The synthetic cycles were generated using the mechanistic modeling approach. See “Big data training data for artificial intelligence-based Li-ion diagnosis and prognosis“ (Journal of Power Sources, Volume 479, 15 December 2020, 228806) and "Analysis of Synthetic Voltage vs. Capacity Datasets for Big Data Li-ion Diagnosis and Prognosis" (Energies 2021, 14, 2371 ) for more details.

Two sets of data are available, one for training and one for validation Training dataset: MEDB_PI folder, clear-sky irradiance, 0.025 triplet resolution up to 50% degradation with 2% increment. Validation dataset: MEDB_Cloud, 18 different cloudy days, 0.05 triplet resolution up to 50% degradation with 2% increment.

All datasets were generated with slightly different cell parameters to account for cell-to-cell variations. Details are available in publication. For each duty cycle, 3 set of files are provided, the *_V files contains V vs. Q data, the *_t files contains V vs. time data and the *_R files contains rate vs. Q data.

For each file, column in the volt, voltT, ot rate variable corresponds to 1 degradation path, the 1001 lines corresponds to the resolution in variable Q (for the capacity based data) or timenorm (for the time-based data). Details of each duty cycle is provided in the pathinfo variable with headers in pathinfo_index ( 1 - % LLI, 2 - % LAMPE, 3 - % LAMNE, 4 - Capacity, 5 - DOD).

All simulations were performed with the 2022 version of the alawa toolbox. Voltage and kinetics of electrodes from different manufacturers, with different composition, or with different architecture might differ significantly.

MEDB_irradiancedata.mat contains data gathered for 2 years at MEDB site (see publication for details). Data provided courtesy of HNEI’s Severine Busquet, Jonathan Kobayashi, and Richard Rocheleau

This matlab structure contains the following variables: dn: time from Matlab reference time dh: hour of the day doy: day of the year dm: month of the year dy: year ghi: global irradiance (W/m2) class: clear sky yes/no perc_clear: clear sky percentage (%) meta: panel metadata tot_POA: clear sky irradiance at POA (W/m2) inc: Solar angle of incidence relative to the POA (degree) tot_horz: clear sky horizontal irradiance (W/m2)
NSW Office of Water GW licence extract linked to spatial locations - GLO v5...
researchdata.edu.au
data.gov.au
Updated Jun 14, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bioregional Assessment Program (2016). NSW Office of Water GW licence extract linked to spatial locations - GLO v5 UID elements 27032014 [Dataset]. https://researchdata.edu.au/nsw-office-water-elements-27032014/2987935
Explore at:
Dataset updated
Jun 14, 2016
Dataset provided by
Data.govhttps://data.gov/
Authors
Bioregional Assessment Program
Area covered
New South Wales
Description
Abstract

This dataset was derived from groundwater data provided by the NSW Office of Water. You can find a link to the source dataset in the Lineage Field in this metadata statement. The History Field in this metadata statement describes how this dataset was derived.

The difference between NSW Office of Water GW licences - GLO v4 and v5 is that the element list created for the asset database has been added to a new worksheet. Additional columns have then beed added to the element list including BoreID, share per works, volume per works and depth. The depth field was added for QA purposes. There are 12 bores in the source data that have a depth that was not converted across to the element list. This issue has been raised with the A&R team to fix the element list. Also, where there is a boreID and no depth, the National Groundwater Information System (NGIS) has been checked to determine if there are depths available for bores. One additional depth was added from the NGIS.

The aim of this dataset was to be able to map each groundwater works with the volumetric entitlement without double counting the volume and to aggregate/ disaggregate the data depending on the final use.

This has not been clipped to the Gloucester PAE, therefore the number of economic assets/ relevant licences will drastically reduce once this occurs.

Dataset History

The difference between NSW Office of Water GW licences - GLO v4 and v5 is that the element list created for the asset database has been added to a new worksheet. Additional columns have then beed added to the element list including BoreID, share per works, volume per works and depth. The depth field was added for QA purposes. There are 12 bores in the source data that have a depth that was not converted across to the element list. This issue has been raised with the A&R team to fix the element list.

Dataset Citation

Bioregional Assessment Programme (2014) NSW Office of Water GW licence extract linked to spatial locations - GLO v5 UID elements 27032014. Bioregional Assessment Derived Dataset. Viewed 18 July 2018, http://data.bioregionalassessments.gov.au/dataset/0115c2ba-73c6-4c98-b539-9f67594980cf.

Dataset Ancestors

Derived From NSW Office of Water Groundwater Licence Extract Gloucester - Oct 2013

Derived From NSW Office of Water Groundwater licence extract linked to spatial locations GLOv2 19022014

Derived From NSW Office of Water GW licence extract linked to spatial locations GLOv4 UID 14032014

Derived From NSW Office of Water Groundwater Entitlements Spatial Locations

Derived From National Groundwater Information System (NGIS) v1.1

Derived From NSW Office of Water GW licence extract linked to spatial locations GLOv3 12032014
ME70 Water Column Sonar Data Collected During SH1709
catalog.data.gov
Updated Nov 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NOAA National Centers for Environmental Information (Point of Contact) (2024). ME70 Water Column Sonar Data Collected During SH1709 [Dataset]. https://catalog.data.gov/dataset/me70-water-column-sonar-data-collected-during-sh1709
Explore at:
Dataset updated
Nov 1, 2024
Dataset provided by
National Oceanic and Atmospheric Administrationhttp://www.noaa.gov/
National Centers for Environmental Informationhttps://www.ncei.noaa.gov/
Description
Many species of rockfishes live in complex rocky habitats, have been over-fished, and are difficult or impossible to accurately survey using conventional bottom-trawl gear. Our ability to count these species in rocky habitats and to delineate the distribution and extent of these habitats is critical to the estimation of absolute abundance of these species for stock assessments. To that end, NMFS is pursuing the Untrawlable Habitat Strategic Initiative (UHSI) field research in the Southern California Bight. Associated with the goals of the UHSI, NMFS also recognizes the need for more high-resolution mapping of the seafloor in order to delineate and quantify rockfish habitats. Research planned for October 2017 on the B. Shimada represents year-2 of the UHSI project in Southern California. We are using the results from our year-1 study off the R. Lasker in October 2016 in Southern California to inform the experiments we will conduct in this second year from the B. Shimada . We also will continue our plan to map the seafloor at priority sites in and around the Channel Islands. During this mission, we will 1) rendezvous with R/V Velero IV (contracted through NMFS) and use NMFSâ€™s Seabed autonomous underwater vehicle (AUV) as part of an underwater experiment to observe and quantify the behavior of rockfishes in reaction to mobile survey vehicles; 2) acquire high-resolution bathymetric data around the northern Channel Islands using the vesselâ€™s ME70 sonar; 3) survey rockfishes and habitats visually using the AUV; 4) deploy and retrieve small drop cameras to observe fishes on the seafloor. This is a multi-year collaboration among researchers from the NMFS SWFSC, NWFSC, SEFSC, and AFSC, and complements ongoing similar surveys being conducted in the Gulf of Mexico as well as ongoing seafloor mapping and habitat surveys being conducted by NOAAs Channel Islands National Marine Sanctuary. The results of this mission will lead to more accurate estimates of demersal fish populations and associated habitats in deep-water, thereby supporting NOAAâ€™s objectives to achieve sustainable fisheries and improve our understanding of marine ecosystems. Our findings will improve stock assessments of species in untrawlable habitats, and will assist in the interpretation and understanding of the use of deepwater habitats by demersal fishes.
C
2001 Crimes, with all columns
data.cityofchicago.org
Updated Dec 2, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chicago Police Department (2025). 2001 Crimes, with all columns [Dataset]. https://data.cityofchicago.org/Public-Safety/2001-Crimes-with-all-columns/8973-dj98
Explore at:
application/geo+json, kmz, xlsx, xml, csv, kmlAvailable download formats
Dataset updated
Dec 2, 2025
Authors
Chicago Police Department
Description
This dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago from 2001 to present, minus the most recent seven days. Data is extracted from the Chicago Police Department's CLEAR (Citizen Law Enforcement Analysis and Reporting) system. In order to protect the privacy of crime victims, addresses are shown at the block level only and specific locations are not identified. Should you have questions about this dataset, you may contact the Research & Development Division of the Chicago Police Department at PSITAdministration@ChicagoPolice.org. Disclaimer: These crimes may be based upon preliminary information supplied to the Police Department by the reporting parties that have not been verified. The preliminary crime classifications may be changed at a later date based upon additional investigation and there is always the possibility of mechanical or human error. Therefore, the Chicago Police Department does not guarantee (either expressed or implied) the accuracy, completeness, timeliness, or correct sequencing of the information and the information should not be used for comparison purposes over time. The Chicago Police Department will not be responsible for any error or omission, or for the use of, or the results obtained from the use of this information. All data visualizations on maps should be considered approximate and attempts to derive specific addresses are strictly prohibited. The Chicago Police Department is not responsible for the content of any off-site pages that are referenced by or that reference this web page other than an official City of Chicago or Chicago Police Department web page. The user specifically acknowledges that the Chicago Police Department is not responsible for any defamatory, offensive, misleading, or illegal conduct of other users, links, or third parties and that the risk of injury from the foregoing rests entirely with the user. The unauthorized use of the words "Chicago Police Department," "Chicago Police," or any colorable imitation of these words or the unauthorized use of the Chicago Police Department logo is unlawful. This web page does not, in any way, authorize such use. Data are updated daily. To access a list of Chicago Police Department - Illinois Uniform Crime Reporting (IUCR) codes, go to http://data.cityofchicago.org/Public-Safety/Chicago-Police-Department-Illinois-Uniform-Crime-R/c7ck-438e
Z
Modular control of human movement during running: an open access data set
data.niaid.nih.gov
Updated Jun 18, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Santuz, Alessandro; Ekizos, Antonis; Janshen, Lars; Mersmann, Falk; Bohm, Sebastian; Baltzopoulos, Vasilios; Arampatzis, Adamantios (2022). Modular control of human movement during running: an open access data set [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1254380
Explore at:
Dataset updated
Jun 18, 2022
Dataset provided by
Humboldt-Universität zu Berlin
Liverpool John Moores University
Authors
Santuz, Alessandro; Ekizos, Antonis; Janshen, Lars; Mersmann, Falk; Bohm, Sebastian; Baltzopoulos, Vasilios; Arampatzis, Adamantios
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The human body is an outstandingly complex machine including around 1000 muscles and joints acting synergistically. Yet, the coordination of the enormous amount of degrees of freedom needed for movement is mastered by our one brain and spinal cord. The idea that some synergistic neural components of movement exist was already suggested at the beginning of the XX century. Since then, it has been widely accepted that the central nervous system might simplify the production of movement by avoiding the control of each muscle individually. Instead, it might be controlling muscles in common patterns that have been called muscle synergies. Only with the advent of modern computational methods and hardware it has been possible to numerically extract synergies from electromyography (EMG) signals. However, typical experimental setups do not include a big number of individuals, with common sample sizes of five to 20 participants. With this study, we make publicly available a set of EMG activities recorded during treadmill running from the right lower limb of 135 healthy and young adults (78 males, 57 females). Moreover, we include in this open access data set the code used to extract synergies from EMG data using non-negative matrix factorization and the relative outcomes. Muscle synergies, containing the time-invariant muscle weightings (motor modules) and the time-dependent activation coefficients (motor primitives), were extracted from 13 ipsilateral EMG activities using non-negative matrix factorization. Four synergies were enough to describe as many gait cycle phases during running: weight acceptance, propulsion, early swing and late swing. We foresee many possible applications of our data, that we can summarize in three key points. First, it can be a prime source for broadening the representation of human motor control due to the big sample size. Second, it could serve as a benchmark for scientists from multiple disciplines such as musculoskeletal modelling, robotics, clinical neuroscience, sport science, etc. Third, the data set could be used both to train students or to support established scientists in the perfection of current muscle synergies extraction methods.

The "RAW_DATA.RData" R list consists of elements of S3 class "EMG", each of which is a human locomotion trial containing cycle segmentation timings and raw electromyographic (EMG) data from 13 muscles of the right-side leg. Cycle times are structured as data frames containing two columns that correspond to touchdown (first column) and lift-off (second column). Raw EMG data sets are also structured as data frames with one row for each recorded data point and 14 columns. The first column contains the incremental time in seconds. The remaining 13 columns contain the raw EMG data, named with the following muscle abbreviations: ME = gluteus medius, MA = gluteus maximus, FL = tensor fasciæ latæ, RF = rectus femoris, VM = vastus medialis, VL = vastus lateralis, ST = semitendinosus, BF = biceps femoris, TA = tibialis anterior, PL = peroneus longus, GM = gastrocnemius medialis, GL = gastrocnemius lateralis, SO = soleus.

The file "dataset.rar" contains data in older format, not compatible with the R package musclesyneRgies.
Z
Data from: Russian Financial Statements Database: A firm-level collection of...
data.niaid.nih.gov
Updated Mar 14, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bondarkov, Sergey; Ledenev, Victor; Skougarevskiy, Dmitriy (2025). Russian Financial Statements Database: A firm-level collection of the universe of financial statements [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14622208
Explore at:
Dataset updated
Mar 14, 2025
Dataset provided by
European University at St. Petersburg
European University at St Petersburg
Authors
Bondarkov, Sergey; Ledenev, Victor; Skougarevskiy, Dmitriy
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
The Russian Financial Statements Database (RFSD) is an open, harmonized collection of annual unconsolidated financial statements of the universe of Russian firms:

🔓 First open data set with information on every active firm in Russia.

🗂️ First open financial statements data set that includes non-filing firms.

🏛️ Sourced from two official data providers: the Rosstat and the Federal Tax Service.

📅 Covers 2011-2023 initially, will be continuously updated.

🏗️ Restores as much data as possible through non-invasive data imputation, statement articulation, and harmonization.

The RFSD is hosted on 🤗 Hugging Face and Zenodo and is stored in a structured, column-oriented, compressed binary format Apache Parquet with yearly partitioning scheme, enabling end-users to query only variables of interest at scale.

The accompanying paper provides internal and external validation of the data: http://arxiv.org/abs/2501.05841.

Here we present the instructions for importing the data in R or Python environment. Please consult with the project repository for more information: http://github.com/irlcode/RFSD.

Importing The Data

You have two options to ingest the data: download the .parquet files manually from Hugging Face or Zenodo or rely on 🤗 Hugging Face Datasets library.

Python

🤗 Hugging Face Datasets

It is as easy as:

from datasets import load_dataset import polars as pl

This line will download 6.6GB+ of all RFSD data and store it in a 🤗 cache folder

RFSD = load_dataset('irlspbru/RFSD')

Alternatively, this will download ~540MB with all financial statements for 2023# to a Polars DataFrame (requires about 8GB of RAM)

RFSD_2023 = pl.read_parquet('hf://datasets/irlspbru/RFSD/RFSD/year=2023/*.parquet')

Please note that the data is not shuffled within year, meaning that streaming first n rows will not yield a random sample.

Local File Import

Importing in Python requires pyarrow package installed.

import pyarrow.dataset as ds import polars as pl

Read RFSD metadata from local file

RFSD = ds.dataset("local/path/to/RFSD")

Use RFSD_dataset.schema to glimpse the data structure and columns' classes

print(RFSD.schema)

Load full dataset into memory

RFSD_full = pl.from_arrow(RFSD.to_table())

Load only 2019 data into memory

RFSD_2019 = pl.from_arrow(RFSD.to_table(filter=ds.field('year') == 2019))

Load only revenue for firms in 2019, identified by taxpayer id

RFSD_2019_revenue = pl.from_arrow( RFSD.to_table( filter=ds.field('year') == 2019, columns=['inn', 'line_2110'] ) )

Give suggested descriptive names to variables

renaming_df = pl.read_csv('local/path/to/descriptive_names_dict.csv') RFSD_full = RFSD_full.rename({item[0]: item[1] for item in zip(renaming_df['original'], renaming_df['descriptive'])})

R

Local File Import

Importing in R requires arrow package installed.

library(arrow) library(data.table)

Read RFSD metadata from local file

RFSD <- open_dataset("local/path/to/RFSD")

Use schema() to glimpse into the data structure and column classes

schema(RFSD)

Load full dataset into memory

scanner <- Scanner$create(RFSD) RFSD_full <- as.data.table(scanner$ToTable())

Load only 2019 data into memory

scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scanner <- scan_builder$Finish() RFSD_2019 <- as.data.table(scanner$ToTable())

Load only revenue for firms in 2019, identified by taxpayer id

scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scan_builder$Project(cols = c("inn", "line_2110")) scanner <- scan_builder$Finish() RFSD_2019_revenue <- as.data.table(scanner$ToTable())

Give suggested descriptive names to variables

renaming_dt <- fread("local/path/to/descriptive_names_dict.csv") setnames(RFSD_full, old = renaming_dt$original, new = renaming_dt$descriptive)

Use Cases

🌍 For macroeconomists: Replication of a Bank of Russia study of the cost channel of monetary policy in Russia by Mogiliat et al. (2024) — interest_payments.md

🏭 For IO: Replication of the total factor productivity estimation by Kaukin and Zhemkova (2023) — tfp.md

🗺️ For economic geographers: A novel model-less house-level GDP spatialization that capitalizes on geocoding of firm addresses — spatialization.md

FAQ

Why should I use this data instead of Interfax's SPARK, Moody's Ruslana, or Kontur's Focus?hat is the data period?

To the best of our knowledge, the RFSD is the only open data set with up-to-date financial statements of Russian companies published under a permissive licence. Apart from being free-to-use, the RFSD benefits from data harmonization and error detection procedures unavailable in commercial sources. Finally, the data can be easily ingested in any statistical package with minimal effort.

What is the data period?

We provide financials for Russian firms in 2011-2023. We will add the data for 2024 by July, 2025 (see Version and Update Policy below).

Why are there no data for firm X in year Y?

Although the RFSD strives to be an all-encompassing database of financial statements, end users will encounter data gaps:

We do not include financials for firms that we considered ineligible to submit financial statements to the Rosstat/Federal Tax Service by law: financial, religious, or state organizations (state-owned commercial firms are still in the data).

Eligible firms may enjoy the right not to disclose under certain conditions. For instance, Gazprom did not file in 2022 and we had to impute its 2022 data from 2023 filings. Sibur filed only in 2023, Novatek — in 2020 and 2021. Commercial data providers such as Interfax's SPARK enjoy dedicated access to the Federal Tax Service data and therefore are able source this information elsewhere.

Firm may have submitted its annual statement but, according to the Uniform State Register of Legal Entities (EGRUL), it was not active in this year. We remove those filings.

Why is the geolocation of firm X incorrect?

We use Nominatim to geocode structured addresses of incorporation of legal entities from the EGRUL. There may be errors in the original addresses that prevent us from geocoding firms to a particular house. Gazprom, for instance, is geocoded up to a house level in 2014 and 2021-2023, but only at street level for 2015-2020 due to improper handling of the house number by Nominatim. In that case we have fallen back to street-level geocoding. Additionally, streets in different districts of one city may share identical names. We have ignored those problems in our geocoding and invite your submissions. Finally, address of incorporation may not correspond with plant locations. For instance, Rosneft has 62 field offices in addition to the central office in Moscow. We ignore the location of such offices in our geocoding, but subsidiaries set up as separate legal entities are still geocoded.

Why is the data for firm X different from https://bo.nalog.ru/?

Many firms submit correcting statements after the initial filing. While we have downloaded the data way past the April, 2024 deadline for 2023 filings, firms may have kept submitting the correcting statements. We will capture them in the future releases.

Why is the data for firm X unrealistic?

We provide the source data as is, with minimal changes. Consider a relatively unknown LLC Banknota. It reported 3.7 trillion rubles in revenue in 2023, or 2% of Russia's GDP. This is obviously an outlier firm with unrealistic financials. We manually reviewed the data and flagged such firms for user consideration (variable outlier), keeping the source data intact.

Why is the data for groups of companies different from their IFRS statements?

We should stress that we provide unconsolidated financial statements filed according to the Russian accounting standards, meaning that it would be wrong to infer financials for corporate groups with this data. Gazprom, for instance, had over 800 affiliated entities and to study this corporate group in its entirety it is not enough to consider financials of the parent company.

Why is the data not in CSV?

The data is provided in Apache Parquet format. This is a structured, column-oriented, compressed binary format allowing for conditional subsetting of columns and rows. In other words, you can easily query financials of companies of interest, keeping only variables of interest in memory, greatly reducing data footprint.

Version and Update Policy

Version (SemVer): 1.0.0.

We intend to update the RFSD annualy as the data becomes available, in other words when most of the firms have their statements filed with the Federal Tax Service. The official deadline for filing of previous year statements is April, 1. However, every year a portion of firms either fails to meet the deadline or submits corrections afterwards. Filing continues up to the very end of the year but after the end of April this stream quickly thins out. Nevertheless, there is obviously a trade-off between minimization of data completeness and version availability. We find it a reasonable compromise to query new data in early June, since on average by the end of May 96.7% statements are already filed, including 86.4% of all the correcting filings. We plan to make a new version of RFSD available by July.

Licence

Creative Commons License Attribution 4.0 International (CC BY 4.0).

Copyright © the respective contributors.

Citation

Please cite as:

@unpublished{bondarkov2025rfsd, title={{R}ussian {F}inancial {S}tatements {D}atabase}, author={Bondarkov, Sergey and Ledenev, Victor and Skougarevskiy, Dmitriy}, note={arXiv preprint arXiv:2501.05841}, doi={https://doi.org/10.48550/arXiv.2501.05841}, year={2025}}

Acknowledgments and Contacts

Data collection and processing: Sergey Bondarkov, sbondarkov@eu.spb.ru, Viktor Ledenev, vledenev@eu.spb.ru

Project conception, data validation, and use cases: Dmitriy Skougarevskiy, Ph.D.,
KC_House Dataset -Linear Regression of Home Prices
kaggle.com
zip
Updated May 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vikram amin (2023). KC_House Dataset -Linear Regression of Home Prices [Dataset]. https://www.kaggle.com/datasets/vikramamin/kc-house-dataset-home-prices
Explore at:
zip(776807 bytes)Available download formats
Dataset updated
May 15, 2023
Authors
vikram amin
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset: House pricing dataset containing 21 columns and 21613 rows.

Programming Language : R

Objective : To predict house prices by creating a model

Steps : A) Import the dataset B) Install and run libraries C) Data Cleaning - Remove Null Values , Change Data Types , Dropping of Columns which are not important D) Data Analysis - (i)Linear Regression Model was used to establish the relationship between the dependent variable (price) and other independent variable (ii) Outliers were identified and removed (iii) Regression model was run once again after removing the outliers (iv) Multiple R- squared was calculated which indicated the independent variables can explain 73% change/ variation in the dependent variable (v) P value was less than that of alpha 0.05 which shows it is statistically significant. (vi) Interpreting the meaning of the results of the coefficients (vii) Checked the assumption of multicollinearity (viii) VIF(Variance inflation factor) was calculated for all the independent variables and their absolute value was found to be less than 5. Hence, there is not threat of multicollinearity and that we can proceed with the independent variables specified.
w
CLM Groundwater model uncertainty analysis
data.wu.ac.at
researchdata.edu.au
+1more
Updated Jul 10, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bioregional Assessment Programme (2017). CLM Groundwater model uncertainty analysis [Dataset]. https://data.wu.ac.at/schema/data_gov_au/ZGM0M2FjOTMtNzk1Ny00YzFlLWJkMjgtMTA5MjljY2M1ZjNi
Explore at:
Dataset updated
Jul 10, 2017
Dataset provided by
Bioregional Assessment Programme
Description
Abstract

This dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.

The dataset contains an interpolated grid of the probability of exceeding 0.2 m drawdown at the water table in the Clarence Moreton subregion.

Purpose

These formed the basis of the figures in the CLM 2.6.2 product.

Dataset History

All the inputs for this data set were obtained from the groundwater model data set. Spreadsheet 'CLM_MF_dmax_tmax_excprob.csv' in the source dataset has the probability of exceeding 0.2mdrawdown in column P(dmax>0.2 m). The values for computation nodes labelled 1 in the column 'layer' are interpolated to a regular grid.

Dataset Citation

Bioregional Assessment Programme (2016) CLM Groundwater model uncertainty analysis. Bioregional Assessment Derived Dataset. Viewed 10 July 2017, http://data.bioregionalassessments.gov.au/dataset/19a52dab-847a-4f6e-8904-68d062047866.

Dataset Ancestors

Derived From QLD Dept of Natural Resources and Mines, Groundwater Entitlements 20131204

Derived From Qld 100K mapsheets - Mount Lindsay

Derived From Qld 100K mapsheets - Helidon

Derived From Qld 100K mapsheets - Ipswich

Derived From CLM - Woogaroo Subgroup extent

Derived From Coal Bore Holes - QLD

Derived From CLM - Extent of Logan and Albert river alluvial systems

Derived From CLM - Bore allocations NSW v02

Derived From CLM - Bore allocations NSW

Derived From CLM - Bore assignments NSW and QLD summary tables

Derived From CLM - Geology NSW & Qld combined v02

Derived From CLM - Orara-Bungawalbin bedrock

Derived From CLM16gwl NSW Office of Water_GW licence extract linked to spatial locations_CLM_v3_13032014

Derived From CLM groundwater model hydraulic property data

Derived From CLM16swg Surface water gauging station data within the Clarence Moreton Basin

Derived From GEODATA TOPO 250K Series 3, File Geodatabase format (.gdb)

Derived From CLM - Coal Bore Holes in QLD region of Clarence-Moreton bioregion

Derived From CLM - Gatton Sandstone extent

Derived From CLM16gwl NSW Office of Water, GW licence extract linked to spatial locations in CLM v2 28022014

Derived From Bioregional Assessment areas v03

Derived From NSW Geological Survey - geological units DRAFT line work.

Derived From Mean Annual Climate Data of Australia 1981 to 2012

Derived From CLM Preliminary Assessment Extent Definition & Report( CLM PAE)

Derived From Qld 100K mapsheets - Caboolture

Derived From CLM - AWRA Calibration Gauges SubCatchments

Derived From CLM - NSW Office of Water Gauge Data for Tweed, Richmond & Clarence rivers. Extract 20140901

Derived From Qld 100k mapsheets - Murwillumbah

Derived From AHGFContractedCatchment - V2.1 - Bremer-Warrill

Derived From Bioregional Assessment areas v01

Derived From Bioregional Assessment areas v02

Derived From QLD Current Exploration Permits for Minerals (EPM) in Queensland 6/3/2013

Derived From Pilot points for prediction interpolation of layer 1 in CLM groundwater model

Derived From CLM - Bore water level NSW

Derived From Climate model 0.05x0.05 cells and cell centroids

Derived From CLM - New South Wales Department of Trade and Investment 3D geological model layers

Derived From CLM - Metgasco 3D geological model formation top grids

Derived From R-scripts for uncertainty analysis v01

Derived From State Transmissivity Estimates for Hydrogeology Cross-Cutting Project

Derived From CLM - Extent of Bremer river and Warrill creek alluvial systems

Derived From NSW Catchment Management Authority Boundaries 20130917

Derived From QLD Department of Natural Resources and Mining Groundwater Database Extract 20131111

Derived From Qld 100K mapsheets - Esk

Derived From QLD Dept of Natural Resources and Mines, Groundwater Entitlements linked to bores and NGIS v4 28072014

Derived From BILO Gridded Climate Data: Daily Climate Data for each year from 1900 to 2012

Derived From CLM - Qld Surface Geology Mapsheets

Derived From NSW Office of Water Pump Test dataset

Derived From CLM - NSW River Gauge pdf documents.

Derived From NSW Office of Water - National Groundwater Information System 20140701

Derived From CLM - New South Wales well completion reports

Derived From Data for river stage interpolation in the CLM groundwater model

Derived From CLM - Extent of Lockyer Creek alluvial system

Derived From CLM - DEM in ascii format

Derived From CLM - Grafton-Rapville bedrock

Derived From CLM - Bore water level QLD

Derived From QLD Coal Seam Gas well locations - 14/08/2014

Derived From CLM - Orara-Kangaroo bedrock

Derived From Qld 100k mapsheets - Warwick

Derived From CLM - Walloon Coal Measures spatial extent

Derived From Geofabric Surface Catchments - V2.1

Derived From CLM - Stratigraphic wells in the QLD area of the Clarence-Moreton bioregion

Derived From CLM - Koukandowie FM bedrock

Derived From CLM - Queensland well completion reports

Derived From National Groundwater Information System (NGIS) v1.1

Derived From Natural Resource Management (NRM) Regions 2010

Derived From [Qld 100k mapsheets -

Facebook

Twitter

Click to copy link

Link copied

Cite

Artur Dragunov (2024). France Weekly Real Estate Listings 2022-2023 [Dataset]. https://www.kaggle.com/datasets/arturdragunov/france-weekly-real-estate-listings-2022-2023

France Weekly Real Estate Listings 2022-2023

Seloger Separate Raw and Clean Merged Listings from 2022-06-26 to 2023-02-26

Explore at:

zip(2750497 bytes)Available download formats

Dataset updated

Apr 3, 2024

Authors

Artur Dragunov

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Area covered

France

Description

These Kaggle datasets provide downloaded real estate listings from the French real estate market, capturing data from a leading platform in France (Seloger), reminiscent of the approach taken for the US dataset from Redfin and UK dataset from Zoopla. It encompasses detailed property listings, pricing, and market trends across France, stored in weekly CSV snapshots. The cleaned and merged version of all the snapshots is named as France_clean_unique.csv.

The cleaning process mirrored that of the US dataset, involving removing irrelevant features, normalizing variable names for dataset consistency with USA and UK, and adjusting variable value ranges to get rid of extreme outliers. To augment the dataset's depth, external factors like inflation rates, stock market volatility, and macroeconomic indicators have been integrated, offering a multifaceted perspective on France's real estate market drivers.

For exact column descriptions, see columns for France_clean_unique.csv and my thesis.

Table 2.5 and Section 2.2.1, which I refer to in the column descriptions, can be found in my thesis; see University Library. Click on Online Access->Hlavni prace.

If you want to continue generating datasets yourself, see my Github Repository for code inspiration.

Let me know if you want to see how I got from raw data to France_clean_unique.csv. There are multiple steps, including cleaning Tableau Prep and R, downloading and merging external variables to the dataset, removing duplicates, and renaming some columns.

Clear search

Close search

Google apps

Main menu

France Weekly Real Estate Listings 2022-2023

Supplementary data and code 1 for "Significant shifts in latitudinal optima...

Data from: Water Temperature of Lakes in the Conterminous U.S. Using the...

Seshat-NLP Dataset Pre-Release

UK Weekly Real Estate Listings 2022-2023

Film Circulation dataset

Data from: A FAIR and modular image-based workflow for knowledge discovery...

Data from: Prediction of Cattle Fever Tick Outbreaks in United States...

HyG: A hydraulic geometry dataset derived from historical stream gage...

Uniform Crime Reporting (UCR) Program Data: Arrests by Age, Sex, and Race,...

US Atlantic margin cold seeps database 2011-2016

Data from: Data and code from: Stem borer herbivory dependent on...

GIC//NMC Solar Battery Synthetic Data 2 - 45,000 x 18 degradation for...

NSW Office of Water GW licence extract linked to spatial locations - GLO v5...

Abstract

Dataset History

Dataset Citation

Dataset Ancestors

ME70 Water Column Sonar Data Collected During SH1709

2001 Crimes, with all columns

Modular control of human movement during running: an open access data set

Data from: Russian Financial Statements Database: A firm-level collection of...

This line will download 6.6GB+ of all RFSD data and store it in a 🤗 cache folder

Alternatively, this will download ~540MB with all financial statements for 2023# to a Polars DataFrame (requires about 8GB of RAM)

Read RFSD metadata from local file

Use RFSD_dataset.schema to glimpse the data structure and columns' classes

Load full dataset into memory

Load only 2019 data into memory

Load only revenue for firms in 2019, identified by taxpayer id

Give suggested descriptive names to variables

Read RFSD metadata from local file

Use schema() to glimpse into the data structure and column classes

Load full dataset into memory

Load only 2019 data into memory

Load only revenue for firms in 2019, identified by taxpayer id

Give suggested descriptive names to variables

KC_House Dataset -Linear Regression of Home Prices

CLM Groundwater model uncertainty analysis

Abstract

Purpose

Dataset History

Dataset Citation

Dataset Ancestors

France Weekly Real Estate Listings 2022-2023

Seloger Separate Raw and Clean Merged Listings from 2022-06-26 to 2023-02-26