18 datasets found

r
R codes and dataset for Visualisation of Diachronic Constructional Change...
researchdata.edu.au
bridges.monash.edu
Updated Apr 1, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gede Primahadi Wijaya Rajeg; Gede Primahadi Wijaya Rajeg (2019). R codes and dataset for Visualisation of Diachronic Constructional Change using Motion Chart [Dataset]. http://doi.org/10.26180/5c844c7a81768
Explore at:
Unique identifier
https://doi.org/10.26180/5c844c7a81768
Dataset updated
Apr 1, 2019
Dataset provided by
Monash University
Authors
Gede Primahadi Wijaya Rajeg; Gede Primahadi Wijaya Rajeg
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Publication

Primahadi Wijaya R., Gede. 2014. Visualisation of diachronic constructional change using Motion Chart. In Zane Goebel, J. Herudjati Purwoko, Suharno, M. Suryadi & Yusuf Al Aried (eds.). Proceedings: International Seminar on Language Maintenance and Shift IV (LAMAS IV), 267-270. Semarang: Universitas Diponegoro. doi: https://doi.org/10.4225/03/58f5c23dd8387

Description of R codes and data files in the repository

This repository is imported from its GitHub repo. Versioning of this figshare repository is associated with the GitHub repo's Release. So, check the Releases page for updates (the next version is to include the unified version of the codes in the first release with the tidyverse).

The raw input data consists of two files (i.e. will_INF.txt and go_INF.txt). They represent the co-occurrence frequency of top-200 infinitival collocates for will and be going to respectively across the twenty decades of Corpus of Historical American English (from the 1810s to the 2000s).

These two input files are used in the R code file 1-script-create-input-data-raw.r. The codes preprocess and combine the two files into a long format data frame consisting of the following columns: (i) decade, (ii) coll (for "collocate"), (iii) BE going to (for frequency of the collocates with be going to) and (iv) will (for frequency of the collocates with will); it is available in the input_data_raw.txt.

Then, the script 2-script-create-motion-chart-input-data.R processes the input_data_raw.txt for normalising the co-occurrence frequency of the collocates per million words (the COHA size and normalising base frequency are available in coha_size.txt). The output from the second script is input_data_futurate.txt.

Next, input_data_futurate.txt contains the relevant input data for generating (i) the static motion chart as an image plot in the publication (using the script 3-script-create-motion-chart-plot.R), and (ii) the dynamic motion chart (using the script 4-script-motion-chart-dynamic.R).

The repository adopts the project-oriented workflow in RStudio; double-click on the Future Constructions.Rproj file to open an RStudio session whose working directory is associated with the contents of this repository.
Kickastarter Campaigns
kaggle.com
zip
Updated Jan 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alessio Cantara (2024). Kickastarter Campaigns [Dataset]. https://www.kaggle.com/datasets/alessiocantara/kickastarter-project/discussion
Explore at:
zip(2233314 bytes)Available download formats
Dataset updated
Jan 25, 2024
Authors
Alessio Cantara
Description
Welcome to my Kickstarter case study! In this project I’m trying to understand what the success’s factors for a Kickstarter campaign are, analyzing an available public dataset from Web Robots. The process of analysis will follow the data analysis roadmap: ASK, PREPARE, PROCESS, ANALYZE, SHARE and ACT.

ASK

Different questions will guide my analysis: 1. Is the campaign duration influencing the success of the project? 2. Is it the chosen funding budget? 3. Which category of campaign is the most likely to be successful?

PREPARE

I’m using the Kickstarter Datasets publicly available on Web Robots. Data are scraped using a bot which collects the data in CSV format once a month and all the data are divided into CSV files. Each table contains: - backers_count : number of people that contributed to the campaign - blurb : a captivating text description of the project - category : the label categorizing the campaign (technology, art, etc) - country - created_at : day and time of campaign creation - deadline : day and time of campaign max end - goal : amount to be collected - launched_at : date and time of campaign launch - name : name of campaign - pledged : amount of money collected - state : success or failure of the campaign

Each month scraping produce a huge amount of CSVs, so for an initial analysis I decided to focus on three months: November and December 2023, and January 2024. I’ve downloaded zipped files which once unzipped contained respectively: 7 CSVs (November 2023), 8 CSVs (December 2023), 8 CSVs (January 2024). Each month was divided into a specific folder.

Having a first look at the spreadsheets, it’s clear that there is some need for cleaning and modification: for example, dates and times are shown in Unix code, there are multiple columns that are not helpful for the scope of my analysis, currencies need to be uniformed (some are US$, some GB£, etc). In general, I have all the data that I need to answer my initial questions, identify trends, and make predictions.

PROCESS

I decided to use R to clean and process the data. For each month I started setting a new working environment in its own folder. After loading the necessary libraries: R library(tidyverse) library(lubridate) library(ggplot2) library(dplyr) library(tidyr) I scripted a general R code that searches for CSVs files in the folder, open them as separate variable and into a single data frame:

csv_files <- list.files(pattern = "\\.csv$") data_frames <- list() for (file in csv_files) { variable_name <- sub("\\.csv$", "", file) assign(variable_name, read.csv(file)) data_frames[[variable_name]] <- get(variable_name) }

Next, I converted some columns in numeric values because I was running into types error when trying to merge all the CSVs into a single comprehensive file.

data_frames <- lapply(data_frames, function(df) { df$converted_pledged_amount <- as.numeric(df$converted_pledged_amount) return(df) }) data_frames <- lapply(data_frames, function(df) { df$usd_exchange_rate <- as.numeric(df$usd_exchange_rate) return(df) }) data_frames <- lapply(data_frames, function(df) { df$usd_pledged <- as.numeric(df$usd_pledged) return(df) })

In each folder I then ran a command to merge the CSVs in a single file (one for November 2023, one for December 2023 and one for January 2024):

all_nov_2023 = bind_rows(data_frames) all_dec_2023 = bind_rows(data_frames) all_jan_2024 = bind_rows(data_frames)`

After merging I converted the UNIX code datestamp into a readable datetime for the columns “created”, “launched”, “deadline” and deleted all the columns that had these data set to 0. I also filtered the values into the “slug” columns to show only the category of the campaign, without unnecessary information for the scope of my analysis. The final table was then saved.

filtered_dec_2023 <- all_dec_2023 %>% #this was modified according to the considered month select(blurb, backers_count, category, country, created_at, launched_at, deadline,currency, usd_exchange_rate, goal, pledged, state) %>% filter(created_at != 0 & deadline != 0 & launched_at != 0) %>% mutate(category_slug = sub('.*?"slug":"(.*?)".*', '\\1', category)) %>% mutate(created = as.POSIXct(created_at, origin = "1970-01-01")) %>% mutate(launched = as.POSIXct(launched_at, origin = "1970-01-01")) %>% mutate(setted_deadline = as.POSIXct(deadline, origin = "1970-01-01")) %>% select(-category, -deadline, -launched_at, -created_at) %>% relocate(created, launched, setted_deadline, .before = goal) write.csv(filtered_dec_2023, "filtered_dec_2023.csv", row.names = FALSE)

The three generated files were then merged into one comprehensive CSV called "kickstarter_cleaned" which was further modified, converting a...
H
Additional Tennessee Eastman Process Simulation Data for Anomaly Detection...
dataverse.harvard.edu
dataone.org
Updated Jul 6, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cory A. Rieth; Ben D. Amsel; Randy Tran; Maia B. Cook (2017). Additional Tennessee Eastman Process Simulation Data for Anomaly Detection Evaluation [Dataset]. http://doi.org/10.7910/DVN/6C3JR1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/6C3JR1
Dataset updated
Jul 6, 2017
Dataset provided by
Harvard Dataverse
Authors
Cory A. Rieth; Ben D. Amsel; Randy Tran; Maia B. Cook
License
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/6C3JR1https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/6C3JR1
Description
User Agreement, Public Domain Dedication, and Disclaimer of Liability. By accessing or downloading the data or work provided here, you, the User, agree that you have read this agreement in full and agree to its terms. The person who owns, created, or contributed a work to the data or work provided here dedicated the work to the public domain and has waived his or her rights to the work worldwide under copyright law. You can copy, modify, distribute, and perform the work, for any lawful purpose, without asking permission. In no way are the patent or trademark rights of any person affected by this agreement, nor are the rights that any other person may have in the work or in how the work is used, such as publicity or privacy rights. Pacific Science & Engineering Group, Inc., its agents and assigns, make no warranties about the work and disclaim all liability for all uses of the work, to the fullest extent permitted by law. When you use or cite the work, you shall not imply endorsement by Pacific Science & Engineering Group, Inc., its agents or assigns, or by another author or affirmer of the work. This Agreement may be amended, and the use of the data or work shall be governed by the terms of the Agreement at the time that you access or download the data or work from this Website. Description This dataverse contains the data referenced in Rieth et al. (2017). Issues and Advances in Anomaly Detection Evaluation for Joint Human-Automated Systems. To be presented at Applied Human Factors and Ergonomics 2017. Each .RData file is an external representation of an R dataframe that can be read into an R environment with the 'load' function. The variables loaded are named ‘fault_free_training’, ‘fault_free_testing’, ‘faulty_testing’, and ‘faulty_training’, corresponding to the RData files. Each dataframe contains 55 columns: Column 1 ('faultNumber') ranges from 1 to 20 in the “Faulty” datasets and represents the fault type in the TEP. The “FaultFree” datasets only contain fault 0 (i.e. normal operating conditions). Column 2 ('simulationRun') ranges from 1 to 500 and represents a different random number generator state from which a full TEP dataset was generated (Note: the actual seeds used to generate training and testing datasets were non-overlapping). Column 3 ('sample') ranges either from 1 to 500 (“Training” datasets) or 1 to 960 (“Testing” datasets). The TEP variables (columns 4 to 55) were sampled every 3 minutes for a total duration of 25 hours and 48 hours respectively. Note that the faults were introduced 1 and 8 hours into the Faulty Training and Faulty Testing datasets, respectively. Columns 4 to 55 contain the process variables; the column names retain the original variable names. Acknowledgments. This work was sponsored by the Office of Naval Research, Human & Bioengineered Systems (ONR 341), program officer Dr. Jeffrey G. Morrison under contract N00014-15-C-5003. The views expressed are those of the authors and do not reflect the official policy or position of the Office of Naval Research, Department of Defense, or US Government.
H
Replication Data for: The Wikipedia Adventure: Field Evaluation of an...
dataverse.harvard.edu
search.dataone.org
Updated Jun 7, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sneha Narayan; Jake Orlowitz; Aaron D. Shaw; Benjamin Mako Hill (2017). Replication Data for: The Wikipedia Adventure: Field Evaluation of an Interactive Tutorial for New Users [Dataset]. http://doi.org/10.7910/DVN/6HPRIG
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/6HPRIG
Dataset updated
Jun 7, 2017
Dataset provided by
Harvard Dataverse
Authors
Sneha Narayan; Jake Orlowitz; Aaron D. Shaw; Benjamin Mako Hill
License
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/6HPRIGhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/6HPRIG
Dataset funded by
National Science Foundation (NSF)
Description
This dataset contains the data and code necessary to replicate work in the following paper: Narayan, Sneha, Jake Orlowitz, Jonathan Morgan, Benjamin Mako Hill, and Aaron Shaw. 2017. “The Wikipedia Adventure: Field Evaluation of an Interactive Tutorial for New Users.” in Proceedings of the 20th ACM Conference on Computer-Supported Cooperative Work & Social Computing (CSCW '17). New York, New York: ACM Press. http://dx.doi.org/10.1145/2998181.2998307 The published paper contains two studies. Study 1 is a descriptive analysis of a survey of Wikipedia editors who played a gamified tutorial. Study 2 is a field experiment that evaluated the same the tutorial. These data are the data used in the field experiment described in Study 2. Description of Files This dataset contains the following files beyond this README: twa.RData — An RData file that includes all variables used in Study 2. twa_analysis.R — A GNU R script that includes all the code used to generate the tables and plots related to Study 2 in the paper. The RData file contains one variable (d) which is an R dataframe (i.e., table) that includes the following columns: userid (integer): The unique numerical ID representing each user on in our sample. These are 8-digit integers and describe public accounts on Wikipedia. sample.date (date string): The day the user was recruited to the study. Dates are formatted in “YYYY-MM-DD” format. In the case of invitees, it is the date their invitation was sent. For users in the control group, these is the date that they would have been invited to the study. edits.all (integer): The total number of edits made by the user on Wikipedia in the 180 days after they joined the study. Edits to user's user pages, user talk pages and subpages are ignored. edits.ns0 (integer): The total number of edits made by user to article pages on Wikipedia in the 180 days after they joined the study. edits.talk (integer): The total number of edits made by user to talk pages on Wikipedia in the 180 days after they joined the study. Edits to a user's user page, user talk page and subpages are ignored. treat (logical): TRUE if the user was invited, FALSE if the user was in control group. play (logical): TRUE if the user played the game. FALSE if the user did not. All users in control are listed as FALSE because any user who had not been invited to the game but played was removed. twa.level (integer): Takes a value 0 of if the user has not played the game. Ranges from 1 to 7 for those who did, indicating the highest level they reached in the game. quality.score (float). This is the average word persistence (over a 6 revision window) over all edits made by this userid. Our measure of word persistence (persistent word revision per word) is a measure of edit quality developed by Halfaker et al. that tracks how long words in an edit persist after subsequent revisions are made to the wiki-page. For more information on how word persistence is calculated, see the following paper: Halfaker, Aaron, Aniket Kittur, Robert Kraut, and John Riedl. 2009. “A Jury of Your Peers: Quality, Experience and Ownership in Wikipedia.” In Proceedings of the 5th International Symposium on Wikis and Open Collaboration (OpenSym '09), 1–10. New York, New York: ACM Press. doi:10.1145/1641309.1641332. Or this page: https://meta.wikimedia.org/wiki/Research:Content_persistence How we created twa.RData The files twa.RData combines datasets drawn from three places: A dataset created by Wikimedia Foundation staff that tracked the details of the experiment and how far people got in the game. The variables userid, sample.date, treat, play, and twa.level were all generated in a dataset created by WMF staff when The Wikipedia Adventure was deployed. All users in the sample created their accounts within 2 days before the date they were entered into the study. None of them had received a Teahouse invitation, a Level 4 user warning, or been blocked from editing at the time that they entered the study. Additionally, all users made at least one edit after the day they were invited. Users were sorted randomly into treatment and control groups, based on which they either received or did not receive an invite to play The Wikipedia Adventure. Edit and text persistence data drawn from public XML dumps created on May 21st, 2015. We used publicly available XML dumps to generate the outcome variables, namely edits.all, edits.ns0, edits.talk and quality.score. We first extracted all edits made by users in our sample during the six month period since they joined the study, excluding edits made to user pages or user talk pages using. We parsed the XML dumps using the Python based wikiq and MediaWikiUtilities software online at: http://projects.mako.cc/source/?p=mediawiki_dump_tools https://github.com/mediawiki-utilities/python-mediawiki-utilities We obtained the XML dumps from: https://dumps.wikimedia.org/enwiki/ A list of edits made by users in our study that were subsequently deleted, created on...
Z
Data from: Russian Financial Statements Database: A firm-level collection of...
data.niaid.nih.gov
Updated Mar 14, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bondarkov, Sergey; Ledenev, Victor; Skougarevskiy, Dmitriy (2025). Russian Financial Statements Database: A firm-level collection of the universe of financial statements [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14622208
Explore at:
Dataset updated
Mar 14, 2025
Dataset provided by
European University at St. Petersburg
European University at St Petersburg
Authors
Bondarkov, Sergey; Ledenev, Victor; Skougarevskiy, Dmitriy
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
The Russian Financial Statements Database (RFSD) is an open, harmonized collection of annual unconsolidated financial statements of the universe of Russian firms:

🔓 First open data set with information on every active firm in Russia.

🗂️ First open financial statements data set that includes non-filing firms.

🏛️ Sourced from two official data providers: the Rosstat and the Federal Tax Service.

📅 Covers 2011-2023 initially, will be continuously updated.

🏗️ Restores as much data as possible through non-invasive data imputation, statement articulation, and harmonization.

The RFSD is hosted on 🤗 Hugging Face and Zenodo and is stored in a structured, column-oriented, compressed binary format Apache Parquet with yearly partitioning scheme, enabling end-users to query only variables of interest at scale.

The accompanying paper provides internal and external validation of the data: http://arxiv.org/abs/2501.05841.

Here we present the instructions for importing the data in R or Python environment. Please consult with the project repository for more information: http://github.com/irlcode/RFSD.

Importing The Data

You have two options to ingest the data: download the .parquet files manually from Hugging Face or Zenodo or rely on 🤗 Hugging Face Datasets library.

Python

🤗 Hugging Face Datasets

It is as easy as:

from datasets import load_dataset import polars as pl

This line will download 6.6GB+ of all RFSD data and store it in a 🤗 cache folder

RFSD = load_dataset('irlspbru/RFSD')

Alternatively, this will download ~540MB with all financial statements for 2023# to a Polars DataFrame (requires about 8GB of RAM)

RFSD_2023 = pl.read_parquet('hf://datasets/irlspbru/RFSD/RFSD/year=2023/*.parquet')

Please note that the data is not shuffled within year, meaning that streaming first n rows will not yield a random sample.

Local File Import

Importing in Python requires pyarrow package installed.

import pyarrow.dataset as ds import polars as pl

Read RFSD metadata from local file

RFSD = ds.dataset("local/path/to/RFSD")

Use RFSD_dataset.schema to glimpse the data structure and columns' classes

print(RFSD.schema)

Load full dataset into memory

RFSD_full = pl.from_arrow(RFSD.to_table())

Load only 2019 data into memory

RFSD_2019 = pl.from_arrow(RFSD.to_table(filter=ds.field('year') == 2019))

Load only revenue for firms in 2019, identified by taxpayer id

RFSD_2019_revenue = pl.from_arrow( RFSD.to_table( filter=ds.field('year') == 2019, columns=['inn', 'line_2110'] ) )

Give suggested descriptive names to variables

renaming_df = pl.read_csv('local/path/to/descriptive_names_dict.csv') RFSD_full = RFSD_full.rename({item[0]: item[1] for item in zip(renaming_df['original'], renaming_df['descriptive'])})

R

Local File Import

Importing in R requires arrow package installed.

library(arrow) library(data.table)

Read RFSD metadata from local file

RFSD <- open_dataset("local/path/to/RFSD")

Use schema() to glimpse into the data structure and column classes

schema(RFSD)

Load full dataset into memory

scanner <- Scanner$create(RFSD) RFSD_full <- as.data.table(scanner$ToTable())

Load only 2019 data into memory

scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scanner <- scan_builder$Finish() RFSD_2019 <- as.data.table(scanner$ToTable())

Load only revenue for firms in 2019, identified by taxpayer id

scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scan_builder$Project(cols = c("inn", "line_2110")) scanner <- scan_builder$Finish() RFSD_2019_revenue <- as.data.table(scanner$ToTable())

Give suggested descriptive names to variables

renaming_dt <- fread("local/path/to/descriptive_names_dict.csv") setnames(RFSD_full, old = renaming_dt$original, new = renaming_dt$descriptive)

Use Cases

🌍 For macroeconomists: Replication of a Bank of Russia study of the cost channel of monetary policy in Russia by Mogiliat et al. (2024) — interest_payments.md

🏭 For IO: Replication of the total factor productivity estimation by Kaukin and Zhemkova (2023) — tfp.md

🗺️ For economic geographers: A novel model-less house-level GDP spatialization that capitalizes on geocoding of firm addresses — spatialization.md

FAQ

Why should I use this data instead of Interfax's SPARK, Moody's Ruslana, or Kontur's Focus?hat is the data period?

To the best of our knowledge, the RFSD is the only open data set with up-to-date financial statements of Russian companies published under a permissive licence. Apart from being free-to-use, the RFSD benefits from data harmonization and error detection procedures unavailable in commercial sources. Finally, the data can be easily ingested in any statistical package with minimal effort.

What is the data period?

We provide financials for Russian firms in 2011-2023. We will add the data for 2024 by July, 2025 (see Version and Update Policy below).

Why are there no data for firm X in year Y?

Although the RFSD strives to be an all-encompassing database of financial statements, end users will encounter data gaps:

We do not include financials for firms that we considered ineligible to submit financial statements to the Rosstat/Federal Tax Service by law: financial, religious, or state organizations (state-owned commercial firms are still in the data).

Eligible firms may enjoy the right not to disclose under certain conditions. For instance, Gazprom did not file in 2022 and we had to impute its 2022 data from 2023 filings. Sibur filed only in 2023, Novatek — in 2020 and 2021. Commercial data providers such as Interfax's SPARK enjoy dedicated access to the Federal Tax Service data and therefore are able source this information elsewhere.

Firm may have submitted its annual statement but, according to the Uniform State Register of Legal Entities (EGRUL), it was not active in this year. We remove those filings.

Why is the geolocation of firm X incorrect?

We use Nominatim to geocode structured addresses of incorporation of legal entities from the EGRUL. There may be errors in the original addresses that prevent us from geocoding firms to a particular house. Gazprom, for instance, is geocoded up to a house level in 2014 and 2021-2023, but only at street level for 2015-2020 due to improper handling of the house number by Nominatim. In that case we have fallen back to street-level geocoding. Additionally, streets in different districts of one city may share identical names. We have ignored those problems in our geocoding and invite your submissions. Finally, address of incorporation may not correspond with plant locations. For instance, Rosneft has 62 field offices in addition to the central office in Moscow. We ignore the location of such offices in our geocoding, but subsidiaries set up as separate legal entities are still geocoded.

Why is the data for firm X different from https://bo.nalog.ru/?

Many firms submit correcting statements after the initial filing. While we have downloaded the data way past the April, 2024 deadline for 2023 filings, firms may have kept submitting the correcting statements. We will capture them in the future releases.

Why is the data for firm X unrealistic?

We provide the source data as is, with minimal changes. Consider a relatively unknown LLC Banknota. It reported 3.7 trillion rubles in revenue in 2023, or 2% of Russia's GDP. This is obviously an outlier firm with unrealistic financials. We manually reviewed the data and flagged such firms for user consideration (variable outlier), keeping the source data intact.

Why is the data for groups of companies different from their IFRS statements?

We should stress that we provide unconsolidated financial statements filed according to the Russian accounting standards, meaning that it would be wrong to infer financials for corporate groups with this data. Gazprom, for instance, had over 800 affiliated entities and to study this corporate group in its entirety it is not enough to consider financials of the parent company.

Why is the data not in CSV?

The data is provided in Apache Parquet format. This is a structured, column-oriented, compressed binary format allowing for conditional subsetting of columns and rows. In other words, you can easily query financials of companies of interest, keeping only variables of interest in memory, greatly reducing data footprint.

Version and Update Policy

Version (SemVer): 1.0.0.

We intend to update the RFSD annualy as the data becomes available, in other words when most of the firms have their statements filed with the Federal Tax Service. The official deadline for filing of previous year statements is April, 1. However, every year a portion of firms either fails to meet the deadline or submits corrections afterwards. Filing continues up to the very end of the year but after the end of April this stream quickly thins out. Nevertheless, there is obviously a trade-off between minimization of data completeness and version availability. We find it a reasonable compromise to query new data in early June, since on average by the end of May 96.7% statements are already filed, including 86.4% of all the correcting filings. We plan to make a new version of RFSD available by July.

Licence

Creative Commons License Attribution 4.0 International (CC BY 4.0).

Copyright © the respective contributors.

Citation

Please cite as:

@unpublished{bondarkov2025rfsd, title={{R}ussian {F}inancial {S}tatements {D}atabase}, author={Bondarkov, Sergey and Ledenev, Victor and Skougarevskiy, Dmitriy}, note={arXiv preprint arXiv:2501.05841}, doi={https://doi.org/10.48550/arXiv.2501.05841}, year={2025}}

Acknowledgments and Contacts

Data collection and processing: Sergey Bondarkov, sbondarkov@eu.spb.ru, Viktor Ledenev, vledenev@eu.spb.ru

Project conception, data validation, and use cases: Dmitriy Skougarevskiy, Ph.D.,
Tennessee Eastman Process Simulation Dataset
kaggle.com
zip
Updated Feb 9, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sergei Averkiev (2020). Tennessee Eastman Process Simulation Dataset [Dataset]. https://www.kaggle.com/averkij/tennessee-eastman-process-simulation-dataset
Explore at:
zip(1370814903 bytes)Available download formats
Dataset updated
Feb 9, 2020
Authors
Sergei Averkiev
Description
Intro

This dataverse contains the data referenced in Rieth et al. (2017). Issues and Advances in Anomaly Detection Evaluation for Joint Human-Automated Systems. To be presented at Applied Human Factors and Ergonomics 2017.

Content

Each .RData file is an external representation of an R dataframe that can be read into an R environment with the 'load' function. The variables loaded are named ‘fault_free_training’, ‘fault_free_testing’, ‘faulty_testing’, and ‘faulty_training’, corresponding to the RData files.

Each dataframe contains 55 columns:

Column 1 ('faultNumber') ranges from 1 to 20 in the “Faulty” datasets and represents the fault type in the TEP. The “FaultFree” datasets only contain fault 0 (i.e. normal operating conditions).

Column 2 ('simulationRun') ranges from 1 to 500 and represents a different random number generator state from which a full TEP dataset was generated (Note: the actual seeds used to generate training and testing datasets were non-overlapping).

Column 3 ('sample') ranges either from 1 to 500 (“Training” datasets) or 1 to 960 (“Testing” datasets). The TEP variables (columns 4 to 55) were sampled every 3 minutes for a total duration of 25 hours and 48 hours respectively. Note that the faults were introduced 1 and 8 hours into the Faulty Training and Faulty Testing datasets, respectively.

Columns 4 to 55 contain the process variables; the column names retain the original variable names.

Acknowledgements

This work was sponsored by the Office of Naval Research, Human & Bioengineered Systems (ONR 341), program officer Dr. Jeffrey G. Morrison under contract N00014-15-C-5003. The views expressed are those of the authors and do not reflect the official policy or position of the Office of Naval Research, Department of Defense, or US Government.

User Agreement

By accessing or downloading the data or work provided here, you, the User, agree that you have read this agreement in full and agree to its terms.

The person who owns, created, or contributed a work to the data or work provided here dedicated the work to the public domain and has waived his or her rights to the work worldwide under copyright law. You can copy, modify, distribute, and perform the work, for any lawful purpose, without asking permission.

In no way are the patent or trademark rights of any person affected by this agreement, nor are the rights that any other person may have in the work or in how the work is used, such as publicity or privacy rights.

Pacific Science & Engineering Group, Inc., its agents and assigns, make no warranties about the work and disclaim all liability for all uses of the work, to the fullest extent permitted by law.

When you use or cite the work, you shall not imply endorsement by Pacific Science & Engineering Group, Inc., its agents or assigns, or by another author or affirmer of the work.

This Agreement may be amended, and the use of the data or work shall be governed by the terms of the Agreement at the time that you access or download the data or work from this Website.
FacialRecognition
kaggle.com
zip
Updated Dec 1, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TheNicelander (2016). FacialRecognition [Dataset]. https://www.kaggle.com/petein/facialrecognition
Explore at:
zip(121674455 bytes)Available download formats
Dataset updated
Dec 1, 2016
Authors
TheNicelander
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description

#https://www.kaggle.com/c/facial-keypoints-detection/details/getting-started-with-r #################################

###Variables for downloaded files data.dir <- ' ' train.file <- paste0(data.dir, 'training.csv') test.file <- paste0(data.dir, 'test.csv') #################################

###Load csv -- creates a data.frame matrix where each column can have a different type. d.train <- read.csv(train.file, stringsAsFactors = F) d.test <- read.csv(test.file, stringsAsFactors = F)

###In training.csv, we have 7049 rows, each one with 31 columns. ###The first 30 columns are keypoint locations, which R correctly identified as numbers. ###The last one is a string representation of the image, identified as a string.

###To look at samples of the data, uncomment this line:

head(d.train)

###Let's save the first column as another variable, and remove it from d.train: ###d.train is our dataframe, and we want the column called Image. ###Assigning NULL to a column removes it from the dataframe

im.train <- d.train$Image d.train$Image <- NULL #removes 'image' from the dataframe

im.test <- d.test$Image d.test$Image <- NULL #removes 'image' from the dataframe

################################# #The image is represented as a series of numbers, stored as a string #Convert these strings to integers by splitting them and converting the result to integer

#strsplit splits the string #unlist simplifies its output to a vector of strings #as.integer converts it to a vector of integers. as.integer(unlist(strsplit(im.train[1], " "))) as.integer(unlist(strsplit(im.test[1], " ")))

###Install and activate appropriate libraries ###The tutorial is meant for Linux and OSx, where they use a different library, so: ###Replace all instances of %dopar% with %do%.

install.packages('foreach')

library("foreach", lib.loc="~/R/win-library/3.3")

###implement parallelization im.train <- foreach(im = im.train, .combine=rbind) %do% { as.integer(unlist(strsplit(im, " "))) } im.test <- foreach(im = im.test, .combine=rbind) %do% { as.integer(unlist(strsplit(im, " "))) } #The foreach loop will evaluate the inner command for each row in im.train, and combine the results with rbind (combine by rows). #%do% instructs R to do all evaluations in parallel. #im.train is now a matrix with 7049 rows (one for each image) and 9216 columns (one for each pixel):

###Save all four variables in data.Rd file ###Can reload them at anytime with load('data.Rd')

save(d.train, im.train, d.test, im.test, file='data.Rd')

load('data.Rd')

#each image is a vector of 96*96 pixels (96*96 = 9216). #convert these 9216 integers into a 96x96 matrix: im <- matrix(data=rev(im.train[1,]), nrow=96, ncol=96)

#im.train[1,] returns the first row of im.train, which corresponds to the first training image. #rev reverse the resulting vector to match the interpretation of R's image function #(which expects the origin to be in the lower left corner).

#To visualize the image we use R's image function: image(1:96, 1:96, im, col=gray((0:255)/255))

#Let’s color the coordinates for the eyes and nose points(96-d.train$nose_tip_x[1], 96-d.train$nose_tip_y[1], col="red") points(96-d.train$left_eye_center_x[1], 96-d.train$left_eye_center_y[1], col="blue") points(96-d.train$right_eye_center_x[1], 96-d.train$right_eye_center_y[1], col="green")

#Another good check is to see how variable is our data. #For example, where are the centers of each nose in the 7049 images? (this takes a while to run): for(i in 1:nrow(d.train)) { points(96-d.train$nose_tip_x[i], 96-d.train$nose_tip_y[i], col="red") }

#there are quite a few outliers -- they could be labeling errors. Looking at one extreme example we get this: #In this case there's no labeling error, but this shows that not all faces are centralized idx <- which.max(d.train$nose_tip_x) im <- matrix(data=rev(im.train[idx,]), nrow=96, ncol=96) image(1:96, 1:96, im, col=gray((0:255)/255)) points(96-d.train$nose_tip_x[idx], 96-d.train$nose_tip_y[idx], col="red")

#One of the simplest things to try is to compute the mean of the coordinates of each keypoint in the training set and use that as a prediction for all images colMeans(d.train, na.rm=T)

#To build a submission file we need to apply these computed coordinates to the test instances: p <- matrix(data=colMeans(d.train, na.rm=T), nrow=nrow(d.test), ncol=ncol(d.train), byrow=T) colnames(p) <- names(d.train) predictions <- data.frame(ImageId = 1:nrow(d.test), p) head(predictions)

#The expected submission format has one one keypoint per row, but we can easily get that with the help of the reshape2 library:

install.packages('reshape2')

library(...
d
Replication Data for Exploring an extinct society through the lens of...
dataone.org
Updated Dec 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wieczorek, Oliver; Malzahn, Melanie (2023). Replication Data for Exploring an extinct society through the lens of Habitus-Field theory and the Tocharian text corpus [Dataset]. http://doi.org/10.7910/DVN/UF8DHK
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/UF8DHK
Dataset updated
Dec 16, 2023
Dataset provided by
Harvard Dataverse
Authors
Wieczorek, Oliver; Malzahn, Melanie
Description
The files and workflow will allow you to replicate the study titled "Exploring an extinct society through the lens of Habitus-Field theory and the Tocharian text corpus". This study aimed at utilizing the CEToM-corpus (https://cetom.univie.ac.at/) (Tocharian) to analyze the life-world of the elites of an extinct society situated in modern eastern China. To acquire the raw data needed for steps 1 & 2, please contact Melanie Malzahn melanie.malzahn@univie.ac.at. We conducted a mixed methods study, containing of close reading, content analysis, and multiple correspondence analysis (MCA). The excel file titled "fragments_architecture_combined.xlsx" allows for replication of the MCA and equates to the third step of the workflow outlined below. We used the following programming languages and packages to prepare the dataset and to analyze the data. Data preparation and merging procedures were achieved in python (version 3.9.10) with packages pandas (version 1.5.3), os (version 3.12.0), re (version 3.12.0), numpy (version 1.24.3), gensim (version 4.3.1), BeautifulSoup4 (version 4.12.2), pyasn1 (version 0.4.8), and langdetect (version 1.0.9). Multiple Correspondence Analyses were conducted in R (version 4.3.2) with the packages FactoMineR (version 2.9), factoextra (version 1.0.7), readxl version(1.4.3), tidyverse version(2.0.0), ggplot2 (version 3.4.4) and psych (version 2.3.9). After requesting the necessary files, please open the scripts in the order outlined bellow and execute the code-files to replicate the analysis: Preparatory step: Create a folder for the python and r-scripts downloadable in this repository. Open the file 0_create folders.py and declare a root folder in line 19. This first script will generate you the following folders: "tarim-brahmi_database" = Folder, which contains tocharian dictionaries and tocharian text fragments. "dictionaries" = contains tocharian A and tocharian B vocabularies, including linguistic features such as translations, meanings, part of speech tags etc. A full overview of the words is provided on https://cetom.univie.ac.at/?words. "fragments" = contains tocharian text fragments as xml-files. "word_corpus_data" = folder will contain excel-files of the corpus data after the first step. "Architectural_terms" = This folder contains the data on the architectural terms used in the dataset (e.g. dwelling, house). "regional_data" = This folder contains the data on the findsports (tocharian and modern chinese equivalent, e.g. Duldur-Akhur & Kucha). "mca_ready_data" = This is the folder, in which the excel-file with the merged data will be saved. Note that the prepared file named "fragments_architecture_combined.xlsx" can be saved into this directory. This allows you to skip steps 1 &2 and reproduce the MCA of the content analysis based on the third step of our workflow (R-Script titled 3_conduct_MCA.R). First step - run 1_read_xml-files.py: loops over the xml-files in folder dictionaries and identifies a) word metadata, including language (Tocharian A or B), keywords, part of speech, lemmata, word etymology, and loan sources. Then, it loops over the xml-textfiles and extracts a text id number, langauge (Tocharian A or B), text title, text genre, text subgenre, prose type, verse type, material on which the text is written, medium, findspot, the source text in tocharian, and the translation where available. After successful feature extraction, the resulting pandas dataframe object is exported to the word_corpus_data folder. Second step - run 2_merge_excel_files.py: merges all excel files (corpus, data on findspots, word data) and reproduces the content analysis, which was based upon close reading in the first place. Third step - run 3_conduct_MCA.R: recodes, prepares, and selects the variables necessary to conduct the MCA. Then produces the descriptive values, before conducitng the MCA, identifying typical texts per dimension, and exporting the png-files uploaded to this repository.

EEA Air Quality In-Situ Measurement Station Data

zenodo.org

bin, zip

Updated Dec 5, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Johannes Heisig; Johannes Heisig (2024). EEA Air Quality In-Situ Measurement Station Data [Dataset]. http://doi.org/10.5281/zenodo.13220430

Explore at:

zip, binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.13220430

Dataset updated

Dec 5, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Johannes Heisig; Johannes Heisig

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Introduction

This dataset is a value-added product based on 'Up-to-date air quality station measurements', administered by the European Environmental Agency (EEA) and collected by its member states. The original hourly measurement data (NO2, SO2, O3, PM10, PM2.5 in µg/m³) was reshaped, gapfilled and aggregated to different temporal resolutions, making it ready to use in time series analysis or spatial interpolation tasks.

Reproducible code for accessing and processing this data and notebooks for demonstration can be found on Github.

Accessing and pre-processing hourly data

Hourly data was retrieved through the API of the EEA Air Quality Download Service. Measurements (single files per station and pollutant) were joined to create a single time series per station with observations for multiple pollutants. As PM2.5 data is sparse but correlates well with PM10, gapfilling was performed according to methods described in Horálek et al., 2023¹. Validity and verification flags from the original data were passed on for quality filtering. Reproducible computational notebooks using the R programming language are available for the data access and the gapfilling procedure.

Temporal aggregates

Data was aggregated to three coarser temporal resolutions: day, month, and year. Coverage (ratio of non-missing value) was calculated for each pollutant and temporal increment. A threshold of 75% was applied to generate reliable aggregates. All pollutants were aggregated by their aritmethic mean. Additionally, two pollutants were aggregated using a percentile method, which has shown to be more appropriate for mapping applications. PM10 was summarized using the 90.41th percentile. Daily O3 was further summarized as the maximum of the 8-hour running mean. Based thereon, monthly and annual O3 was aggregated using the 93.15th percentile of the daily maxima. For more details refer to the reproducible computational notebook on temporal aggregation.

Data columns

column	hourly	daily	monthly	annual	description
Air.Quality.Station.EoI.Code	x	x	x	x	Unique station ID
Countrycode	x	x	x	x	Two-letter ISO country code
Start	x				Start time of (hourly) measurement period
	x	x	x	x	One of NO2; SO2; O3; O3_max8h_93.15; PM10; PM10_90.41; PM2.5 in µg/m³
Validity_	x				Validity flag of the respective pollutant
Verification_	x				Verification flag of the respective pollutant
filled_PM2.5	x				Flag indicating if PM2.5 value is measured or supplemented through gapfilling (boolean)
year		x	x	x	Year (2015-2023)
cov.year_		x		x	Data coverage throughout the year (0-1)
month		x	x		Month (1-12)
cov.month_		x	x		Data coverage throughout the month (0-1)
doy		x			Day of year (0-366)
cov.day_		x			Data coverage throughout the day (0-1)

Station meta data

To avoid redundant information and optimize file size, some relevant meta data is not stored in the air quality data tables, but rather seperately (in a file named "EEA_stations_meta_table.parquet"). This includes type and area of measurement stations, as well as their coordinates.

column	description
Air.Quality.Station.EoI.Code	Unique station ID (required for join)
Countrycode	Two-letter ISO country code
Station.Type	One of "background", "industrial", or "traffic"
Station.Area	One of "urban", "suburban", "rural", "rural-nearcity", "rural-regional", "rural-remote"
Longitude & Latitude	Geographic coordinates of the station

Parquet file format

This dataset is shipped in [Parquet files. Hourly and aggregated data are distributed in four individual datasets. Daily and hourly data are partitioned by `Countrycode` (one file per country) to enable reading smaller subsets. Monthly and annual data files are small (> 20Mb) and stored in a single file each. Parquet is a relatively new and very memory-efficient format, that differs from traditional tabular file formats (e.g. CSV) in the sense that it is binary and cannot be opened and displayed by common tabular software (e.g. MS Excel, Libre Office, etc.). Users rather have to use an Apache Arrow implementation, for example in Python, R, C++, or another scripting language. Reading the data there is straight forward (click to see the code samples below).

R code:

# required libraries
library(arrow)
library(dplyr)

# read air quality and meta data
aq = read_parquet("airquality.no2.o3.so2.pm10.pm2p5_4.annual_pnt_20150101_20231231_eu_epsg.3035_v20240718.parquet")
meta = read_parquet("EEA_stations_meta_table.parquet")

# join the two for further analysis
aq_meta = inner_join(aq, meta, by = join_by(Air.Quality.Station.EoI.Code))

Python code:

# required libraries
import pandas as pd

# read air quality and meta data
aq = pd.read_parquet("airquality.no2.o3.so2.pm10.pm2p5_4.annual_pnt_20150101_20231231_eu_epsg.3035_v20240718.parquet")
meta = pd.read_parquet("EEA_stations_meta_table.parquet")

# join the two for further analysis
aq_meta = aq.merge(meta,on = ["Air.Quality.Station.EoI.Code", "Countrycode"])

Market Basket Analysis
kaggle.com
zip
Updated Dec 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
Explore at:
zip(23875170 bytes)Available download formats
Dataset updated
Dec 9, 2021
Authors
Aslan Ahmedov
Description
Market Basket Analysis

Market basket analysis with Apriori algorithm

The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

Introduction

Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

An Example of Association Rules

Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

Strategy

Data Import

Data Understanding and Exploration

Transformation of the data – so that is ready to be consumed by the association rules algorithm

Running association rules

Exploring the rules generated

Filtering the generated rules

Visualization of Rule

Dataset Description

File name: Assignment-1_Data

List name: retaildata

File format: . xlsx

Number of Row: 522065

Number of Attributes: 7

BillNo: 6-digit number assigned to each transaction. Nominal.

Itemname: Product name. Nominal.

Quantity: The quantities of each product per transaction. Numeric.

Date: The day and time when each transaction was generated. Numeric.

Price: Product price. Numeric.

CustomerID: 5-digit number assigned to each customer. Nominal.

Country: Name of the country where each customer resides. Nominal.

https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

Libraries in R

First, we need to load required libraries. Shortly I describe all libraries.

arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).

arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.

tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.

readxl - Read Excel Files in R.

plyr - Tools for Splitting, Applying and Combining Data.

ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

knitr - Dynamic Report generation in R.

magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.

dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.

tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

Data Pre-processing

Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

After we will clear our data frame, will remove missing values.

https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
All Lending Club loan data
kaggle.com
zip
Updated Apr 10, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nathan George (2019). All Lending Club loan data [Dataset]. https://www.kaggle.com/wordsforthewise/lending-club
Explore at:
zip(1356507910 bytes)Available download formats
Dataset updated
Apr 10, 2019
Authors
Nathan George
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Update: I probably won't be able to update the data anymore, as LendingClub now has a scary 'TOS' popup when downloading the data. Worst case, they will ask me/Kaggle to take it down from here.

This dataset contains the full LendingClub data available from their site. There are separate files for accepted and rejected loans. The accepted loans also include the FICO scores, which can only be downloaded when you are signed in to LendingClub and download the data.

See the Python and R getting started kernels to get started:

R: https://www.kaggle.com/wordsforthewise/eda-in-r-arggghh

Python: https://www.kaggle.com/wordsforthewise/eda-with-python

I created a git repo for the code which is used to create this data: https://github.com/nateGeorge/preprocess_lending_club_data

Background

I wanted an easy way to share all the lending club data with others. Unfortunately, the data on their site is fragmented into many smaller files. There is another lending club dataset on Kaggle, but it wasn't updated in years. It seems like the "Kaggle Team" is updating it now. I think it also doesn't include the full rejected loans, which are included here. It seems like the other dataset confusingly has some of the rejected loans mixed into the accepted ones. Now there are a ton of other LendingClub datasets on here too, most of which seem to have no documentation or explanation of what the data actually is.

Content

The definitions for the fields are on the LendingClub site, at the bottom of the page. Kaggle won't let me upload the .xlsx file for some reason since it seems to be in multiple other data repos. This file seems to be in the other main repo, but again, it's better to get it directly from the source.

Unfortunately, there is (maybe "was" now?) a limit of 500MB for dataset files, so I had to compress the files with gzip in the Python pandas package.

I cleaned the data a tiny bit: I removed percent symbols (%) from int_rate and revol_util columns in the accepted loans and converted those columns to floats.

Update

The URL column is in the dataset for completeness, as of 2018 Q2.
r
Respiration_chambers/raw_log_files and combined datasets of biomass and...
researchdata.edu.au
data.aad.gov.au
Updated Dec 3, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BLACK, JAMES GEOFFREY; Black, J.G.; BLACK, JAMES GEOFFREY; BLACK, JAMES GEOFFREY (2018). Respiration_chambers/raw_log_files and combined datasets of biomass and chamber data, and physical parameters [Dataset]. https://researchdata.edu.au/respirationchambersrawlogfiles-combined-datasets-physical-parameters/1360456
Explore at:
Dataset updated
Dec 3, 2018
Dataset provided by
Australian Antarctic Data Centre
Australian Antarctic Division
Authors
BLACK, JAMES GEOFFREY; Black, J.G.; BLACK, JAMES GEOFFREY; BLACK, JAMES GEOFFREY
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 27, 2015 - Feb 23, 2015
Area covered

Description
General overview
The following datasets are described by this metadata record, and are available for download from the provided URL.

- Raw log files, physical parameters raw log files
- Raw excel files, respiration/PAM chamber raw excel spreadsheets
- Processed and cleaned excel files, respiration chamber biomass data
- Raw rapid light curve excel files (this is duplicated from Raw log files), combined dataset pH, temperature, oxygen, salinity, velocity for experiment
- Associated R script file for pump cycles of respirations chambers

####

Physical parameters raw log files

Raw log files
1) DATE=
2) Time= UTC+11
3) PROG=Automated program to control sensors and collect data
4) BAT=Amount of battery remaining
5) STEP=check aquation manual
6) SPIES=check aquation manual
7) PAR=Photoactive radiation
8) Levels=check aquation manual
9) Pumps= program for pumps
10) WQM=check aquation manual

####

Respiration/PAM chamber raw excel spreadsheets

Abbreviations in headers of datasets
Note: Two data sets are provided in different formats. Raw and cleaned (adj). These are the same data with the PAR column moved over to PAR.all for analysis. All headers are the same. The cleaned (adj) dataframe will work with the R syntax below, alternative add code to do cleaning in R.

Date: ISO 1986 - Check
Time:UTC+11 unless otherwise stated
DATETIME: UTC+11 unless otherwise stated
ID (of instrument in respiration chambers)
ID43=Pulse amplitude fluoresence measurement of control
ID44=Pulse amplitude fluoresence measurement of acidified chamber
ID=1 Dissolved oxygen
ID=2 Dissolved oxygen
ID3= PAR
ID4= PAR
PAR=Photo active radiation umols
F0=minimal florescence from PAM
Fm=Maximum fluorescence from PAM
Yield=(F0 – Fm)/Fm
rChl=an estimate of chlorophyll (Note this is uncalibrated and is an estimate only)
Temp=Temperature degrees C
PAR=Photo active radiation
PAR2= Photo active radiation2
DO=Dissolved oxygen
%Sat= Saturation of dissolved oxygen
Notes=This is the program of the underwater submersible logger with the following abreviations:
Notes-1) PAM=
Notes-2) PAM=Gain level set (see aquation manual for more detail)
Notes-3) Acclimatisation= Program of slowly introducing treatment water into chamber
Notes-4) Shutter start up 2 sensors+sample…= Shutter PAMs automatic set up procedure (see aquation manual)
Notes-5) Yield step 2=PAM yield measurement and calculation of control
Notes-6) Yield step 5= PAM yield measurement and calculation of acidified
Notes-7) Abatus respiration DO and PAR step 1= Program to measure dissolved oxygen and PAR (see aquation manual). Steps 1-4 are different stages of this program including pump cycles, DO and PAR measurements.

8) Rapid light curve data
Pre LC: A yield measurement prior to the following measurement
After 10.0 sec at 0.5% to 8%: Level of each of the 8 steps of the rapid light curve
Odessey PAR (only in some deployments): An extra measure of PAR (umols) using an Odessey data logger
Dataflow PAR: An extra measure of PAR (umols) using a Dataflow sensor.
PAM PAR: This is copied from the PAR or PAR2 column
PAR all: This is the complete PAR file and should be used
Deployment: Identifying which deployment the data came from

####

Respiration chamber biomass data

The data is chlorophyll a biomass from cores from the respiration chambers. The headers are: Depth (mm) Treat (Acidified or control) Chl a (pigment and indicator of biomass) Core (5 cores were collected from each chamber, three were analysed for chl a), these are psudoreplicates/subsamples from the chambers and should not be treated as replicates.

####

Associated R script file for pump cycles of respirations chambers

Associated respiration chamber data to determine the times when respiration chamber pumps delivered treatment water to chambers. Determined from Aquation log files (see associated files). Use the chamber cut times to determine net production rates. Note: Users need to avoid the times when the respiration chambers are delivering water as this will give incorrect results. The headers that get used in the attached/associated R file are start regression and end regression. The remaining headers are not used unless called for in the associated R script. The last columns of these datasets (intercept, ElapsedTimeMincoef) are determined from the linear regressions described below.

To determine the rate of change of net production, coefficients of the regression of oxygen consumption in discrete 180 minute data blocks were determined. R squared values for fitted regressions of these coefficients were consistently high (greater than 0.9). We make two assumptions with calculation of net production rates: the first is that heterotrophic community members do not change their metabolism under OA; and the second is that the heterotrophic communities are similar between treatments.

####

Combined dataset pH, temperature, oxygen, salinity, velocity for experiment

This data is rapid light curve data generated from a Shutter PAM fluorimeter. There are eight steps in each rapid light curve. Note: The software component of the Shutter PAM fluorimeter for sensor 44 appeared to be damaged and would not cycle through the PAR cycles. Therefore the rapid light curves and recovery curves should only be used for the control chambers (sensor ID43).

The headers are
PAR: Photoactive radiation
relETR: F0/Fm x PAR
Notes: Stage/step of light curve
Treatment: Acidified or control

The associated light treatments in each stage. Each actinic light intensity is held for 10 seconds, then a saturating pulse is taken (see PAM methods).

After 10.0 sec at 0.5% = 1 umols PAR
After 10.0 sec at 0.7% = 1 umols PAR
After 10.0 sec at 1.1% = 0.96 umols PAR
After 10.0 sec at 1.6% = 4.32 umols PAR
After 10.0 sec at 2.4% = 4.32 umols PAR
After 10.0 sec at 3.6% = 8.31 umols PAR
After 10.0 sec at 5.3% =15.78 umols PAR
After 10.0 sec at 8.0% = 25.75 umols PAR

This dataset appears to be missing data, note D5 rows potentially not useable information

See the word document in the download file for more information.
Comprehensive Literary Greats Dataset
kaggle.com
zip
Updated Jan 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Comprehensive Literary Greats Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/comprehensive-literary-greats-dataset
Explore at:
zip(29940528 bytes)Available download formats
Dataset updated
Jan 29, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Comprehensive Literary Greats Dataset

50,000+ Books Rated and Awarded Across Language, Genre, and Format

By [source]

About this dataset

This remarkable dataset provides an awe-inspiring collection of over 50,000 books, encompassing the world's best practices in literature, poetry, and authorship. For each book included in the dataset, users can gain access to a wealth of insightful information such as title, author(s), average rating given by readers and critics alike, a brief description highlighting its plot or characteristics; language it is written in; unique ISBN which enables potential buyers to locate their favorite works with ease; genres it belongs to; any awards it has won or characters that inhabit its storyworld.

Additionally, seeking out readers' opinions on exceptional books is made easier due to the availability of bbeScore (best books ever score) alongside details for the most accurate ratings given through well-detailed breakdowns in “ratingsByStars” section. Making sure visibility and recognition are granted fairly – be it a classic novel from time immemorial or merely recently released newcomers - this source also allows us to evaluate new stories based off readers' engagement rate highlighted by likedPercent column (the percentage of readers who liked the book), bbeVotes (number of votes casted) as well as entries related to date published - including showstopping firstPublishDate!

Aspiring literature researchers; literary historians and those seeking hidden literary gems alike would no doubt benefit from delving into this magnificent collection – 25 variables regarding different novels & poets that are presented by Kaggle open source dataset “Best Books Ever: A Comprehensive Historical Collection of Literary Greats”. What worlds awaits you?

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

Whether you are a student, researcher, or enthusiast of literature, this dataset provides a valuable source for exploring literary works from varied time periods and genres. By accessing all 25 variables in the dataset, readers have the opportunity to use them for building visualizations, creating new analysis tools and models, or finding books you might be interested in reading.

First after downloading the dataset into Kaggle Notebooks platform or other programming interfaces of your choice such as R Studio/Python Jupyter Notebooks (Pandas) - make sure that data is arranged into columns with clearly labeled title names. This will help you understand which variable is related to what precise information. Afterwards explore each variable by finding any patterns across particular titles or interesting findings about certain authors/ratings that are available in your research interests.

Utilize the vital columns of Title (title), Author(author), Rating (rating), Description (description), Language (language), Genres (genres) and Characters(characters) - these can assist you in discovering different trends between books according to style of composition or character types etc. Move further down on examining more specific details offered by Book Format(bookFormat), Edition(edition) Pages(pages). Peruse publisher info along with Publish Date(publishDate). Besides these structural elements also take note of Awards column considering recent recognition different titles have received; also observe how much ratings has been collected per text through Numbers Ratings column-(numRatings); analyze reader's feedback according on Ratings By Stars(_ratingsByStars); view LikedPercentage rate provided by readers when analyzing particular book(_likedPercent).

Apart from more accessible factors mentioned previously delve deeper onto more sophisticated data presented: Setting (_setting); Cover Image (_coverImg); BbeScore_bbeScore); BbeVotes_bbeVotes). All those should provide greater insight when trying to explain why certain book has made its way onto GoodReads top selections list! To find value estimate test out Price (_price)) column too - determining if some texts retain large popularity despite rather costly publishing options cost-wise available on market currently?

Finally combine different aspects observed while researching concerning individual titles- create personalized recommendations based upon released comprehensive lists! To achieve that utilize ISUBN code provided; compare publication Vs first publication dates historically recorded; verify awards labeling procedure relied upon give context information on discussed here books progress over years

Research Ideas

Creating a web or mobile...
f
Data used in "A summer heatwave reduced activity, heart rate and autumn body...
datasetcatalog.nlm.nih.gov
figshare.com
Updated Mar 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Evans, Alina L.; Albon, Steve; Król, Elżbieta; Trondrud, L. Monica; Kumpula, Jouko; Speakman, john; Loe, Leif Egil; Pigeon, Gabriel; Ropstad, Erik (2023). Data used in "A summer heatwave reduced activity, heart rate and autumn body mass in a cold-adapted ungulate" [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001093609
Explore at:
Dataset updated
Mar 31, 2023
Authors
Evans, Alina L.; Albon, Steve; Król, Elżbieta; Trondrud, L. Monica; Kumpula, Jouko; Speakman, john; Loe, Leif Egil; Pigeon, Gabriel; Ropstad, Erik
Description
Overview This dataset contains biologging data and R script used to produce the results in "A summer heatwave reduced activity, heart rate and autumn body mass in a cold-adapted ungulate", a submitted manuscript. The longitudinal data of female reindeer and calf body masses used in the paper is owned by the Finnish Reindeer Herders’ Association. Natural Resources Institute Finland (Luke) updates, saves and administrates this long-term reindeer herd data. Methods of data collection Animals and study area The study involved biologging (see below) 14 adult semi-domesticated reindeer females (Focal animals: Table S1) at the Kutuharju Reindeer Research Facility (Kaamanen, Northern Finland, 69° 8’ N, 26° 59’ E, Figure S1), during June–September 2018. Ten of these individuals had been intensively handled in June as part of another study (Trondrud, 2021). The 14 females were part of a herd of ~100 animals, belonging to the Reindeer Herders’ Association. The herding management included keeping reindeer in two large enclosures (~13.8 and ~15 km2) after calving until the rut, after which animals were moved to a winter enclosure (~15 km2) and then in spring to a calving paddock (~0.3 km2) to give birth (See Supporting Information for further details on the study area). Kutuharju reindeer graze freely on natural pastures from May to November and after that are provided with silage and pellets as a supplementary feed in winter. During the period from September to April animals are weighed 5–6 times. In September, body masses of the focal females did not differ from the rest of the herd. Heart rate (HR) and subcutaneous body temperature (Tsc) data In February 2018, the focal females were instrumented with a heart rate (HR) and temperature logger (DST centi-HRT, Star-Oddi, Gardabaer, Iceland). The surgical protocol is described in the Supporting Information. The DST centi-HRT sensors recorded HR and subcutaneous body temperature (Tsc) every 15 min. HR was automatically calculated from a 4-sec electrocardiogram (ECG) at 150 Hz measurement frequency, alongside an index for signal quality. Additional data processing is described in Supporting Information. Activity data The animals were fitted with collar-mounted tri-axial accelerometers (Vertex Plus Activity Sensor, Vectronic Aerospace GmbH, Berlin, Germany) to monitor their activity levels. These sensors recorded acceleration (g) in three directions representing back-forward, lateral, and dorsal-ventral movements at 8 Hz resolution. For each axis, partial dynamic body acceleration (PDBA) was calculated by subtracting the static acceleration using a 4 sec running average from the raw acceleration (Shepard et al., 2008). We estimated vectorial dynamic body acceleration (VeDBA) by calculating the square root of the sum of squared PDBAs (Wilson et al., 2020). We aggregated VeDBA data into 15-min sums (hereafter “sum VeDBA”) to match with HR and Tsc records. Corrections for time offsets are described in Supporting Information. Due to logger failures, only 10 of the 14 individuals had complete data from both loggers (activity and heart rate). Weather and climate data We set up a HOBO weather station (Onset Computer Corporation, Bourne, MA, USA) mounted on a 2 m tall tripod in May 2018 that measured air temperature (Ta, °C) at 15-minute intervals. The placement of the station was between the two summer paddocks. These measurements were matched to the nearest timestamps for VeDBA, HR and Tsc recordings. Also, we obtained weather records from the nearest public weather stations for the years 1990–2021 (Table S2). Weather station IDs and locations relative to the study area are shown in Figure S1 in the Supporting Information. The temperatures at the study site and the nearest weather station were strongly correlated (Pearson’s, r = 0.99), but temperatures were on average ~1.0°C higher at the study site (Figure S2). Statistical analyses All statistical analyses were conducted in R version 4.1.1 (The R Core Team, 2021). Mean values are presented with standard deviation (SD), and parameter estimates with standard error (SE). Environmental effects on activity states and transition probabilities We fitted hidden Markov models (HMM) to 15-min sum VeDBA using the package ‘momentuHMM’ (McClintock & Michelot, 2018). HMMs assume that the observed pattern is driven by an underlying latent state sequence (a finite Markov chain). These states can then be used as proxies to interpret the animal’s unobserved behaviour (Langrock et al., 2012). We assumed only two underlying states, thought to represent ‘inactive’ and ‘active’ (Figure S3). The ‘active’ state thus contains multiple forms of movement, e.g., foraging, walking, and running, but reindeer spend more than 50% of the time foraging in summer (Skogland, 1980). We fitted several HMMs to evaluate both external (temperature and time of day) and individual-level (calf status) effects on the probability to occupy each state (stationary state probabilities). The combination of the explanatory variables in each HMM is listed in Table S5. Ta was fitted as a continuous variable with piecewise polynomial spline with 8 knots, asserted from visual inspection of the model outputs. We included sine and cosine terms for time of day to account for cyclicity. In addition, to assess the impact of Ta on activity patterns, we fitted five temperature-day categories in interaction with time of day. These categories were based on 20% intervals of the distribution of temperature data from our local weather station, in the period 19 June to 19 August 2018, with ranges of < 10°C (cold), 10−13°C (cool), 13−16°C (intermediate) 16−20°C (warm) and ≥ 20°C (hot). We evaluated the significance of each variable on the transition probabilities from the confidence intervals of each estimate, and the goodness-of-fit of each model using Akaike information criteria (AIC) (Burnham & Anderson, 2002), retaining models within ΔAIC < 5. We extracted the most likely state occupied by an individual using the viterbi function, returning the optimal state pathway, i.e., a two-level categorical variable indicating whether the individual was most likely resting or active. We used this output to calculate daily activity budgets (% time spent active). Drivers of heart rate (HR) and subcutaneous body temperature (Tsc) We matched the activity states derived from the HMM to the HR and Tsc data. We opted to investigate the drivers of variation in HR and Tsc only within the inactive state. HR and Tsc were fitted as response variables in separate generalised additive mixed-effects models (GAMM), which included the following smooth terms: calendar day as a thin-plate regression spline, time of day (ToD, in hours, knots [k] = 10) as a cubic circular regression spline and individual as random intercept. All models were fitted using restricted maximum likelihood, a penalization value (λ) of 1.4 (Wood, 2017), and an autoregressive structure (AR1) to account for temporal autocorrelation. We used the ‘gam.check’ function from the ‘mgcv’ package to select k. The sum of VeDBA in the past 15 minutes was included as a predictor in all models. All models were fitted with the same set of explanatory variables: sum VeDBA, age, body mass (BM), lactation status, Ta, as well as the interaction between lactation status and Ta. Description of files 1. Data: "kutuharju_weather.csv" weather data recorded from local weather station during study period "Inari_Ivalo_lentoasema.csv" public weather data from weather station ID 102033, owned and managed by the Finnish Meterorological Institute "activitydata.Rdata" dataset used in analyses of activity patterns in reindeer "HR_temp_data.Rdata" dataset used in analyses of heart rate and body temperature responses in reindeer "HRfigureData.Rdata" and "TempFigureData.Rdata" are data files (lists) with model outputs generated in "heartrate_bodytemp_analyses.R" and used in "figures_in_paper.R" "HMM_df_withStates.Rdata" data frame used in HMM models including output from viterbi function "plotdf_m16.Rdata" dataframe for plotting output from model 16 "plotdf_m22.Rdata" dataframe for plotting output from model 22 2. Scripts "activitydata_HMMs.R" R script for data prep and hidden markov models to analyse activity patterns in reindeer "heartrate_bodytemp_analyses.R" R script for data prep and generalized additive mixed models to analyse heart rate and body temperature responses in reindeer "figures_in_paper.R" R script for generating figures 1-3 in the manuscript 3. HMM_model "modelList.Rdata" list containing 2 items: string of all 25 HMM models created, and dataframe with model number and formula "m16.Rdata" and "m22.Rdata" direct acces to two best-fit models
Global map of tree density
figshare.com
zip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crowther, T. W.; Glick, H. B.; Covey, K. R.; Bettigole, C.; Maynard, D. S.; Thomas, S. M.; Smith, J. R.; Hintler, G.; Duguid, M. C.; Amatulli, G.; Tuanmu, M. N.; Jetz, W.; Salas, C.; Stam, C.; Piotto, D.; Tavani, R.; Green, S.; Bruce, G.; Williams, S. J.; Wiser, S. K.; Huber, M. O.; Hengeveld, G. M.; Nabuurs, G. J.; Tikhonova, E.; Borchardt, P.; Li, C. F.; Powrie, L. W.; Fischer, M.; Hemp, A.; Homeier, J.; Cho, P.; Vibrans, A. C.; Umunay, P. M.; Piao, S. L.; Rowe, C. W.; Ashton, M. S.; Crane, P. R.; Bradford, M. A. (2023). Global map of tree density [Dataset]. http://doi.org/10.6084/m9.figshare.3179986.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3179986.v2
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Crowther, T. W.; Glick, H. B.; Covey, K. R.; Bettigole, C.; Maynard, D. S.; Thomas, S. M.; Smith, J. R.; Hintler, G.; Duguid, M. C.; Amatulli, G.; Tuanmu, M. N.; Jetz, W.; Salas, C.; Stam, C.; Piotto, D.; Tavani, R.; Green, S.; Bruce, G.; Williams, S. J.; Wiser, S. K.; Huber, M. O.; Hengeveld, G. M.; Nabuurs, G. J.; Tikhonova, E.; Borchardt, P.; Li, C. F.; Powrie, L. W.; Fischer, M.; Hemp, A.; Homeier, J.; Cho, P.; Vibrans, A. C.; Umunay, P. M.; Piao, S. L.; Rowe, C. W.; Ashton, M. S.; Crane, P. R.; Bradford, M. A.
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Crowther_Nature_Files.zip This description pertains to the original download. Details on revised (newer) versions of the datasets are listed below. When more than one version of a file exists in Figshare, the original DOI will take users to the latest version, though each version technically has its own DOI. -- Two global maps (raster files) of tree density. These maps highlight how the number of trees varies across the world. One map was generated using biome-level models of tree density, and applied at the biome scale. The other map was generated using ecoregion-level models of tree density, and applied at the ecoregion scale. For this reason, transitions between biomes or between ecoregions may be unrealistically harsh, but large-scale estimates are robust (see Crowther et al 2015 and Glick et al 2016). At the outset, this study was intended to generate reliable estimates at broad spatial scales, which inherently comes at the cost of fine-scale precision. For this reason, country-scale (or larger) estimates are generally more robust than individual pixel-level estimates. Additionally, due to data limitations, estimates for Mangroves and Tropical coniferous forest (as identified by WWF and TNC) were generated using models constructed from Topical moist broadleaf forest data and Temperate coniferous forest data, respectively. Because we used ecological analogy, the estimates for these two biomes should be considered less reliable than those of other biomes . These two maps initially appeared in Crowther et al (2015), with the biome map being featured more prominently. Explicit publication of the data is associated with Glick et al (2016). As they are produced, updated versions of these datasets, as well as alternative formats, will be made available under Additional Versions (see below).

Methods: We collected over 420,000 ground-sources estimates of tree density from around the world. We then constructed linear regression models using vegetative, climatic, topographic, and anthropogenic variables to produce forest tree density estimates for all locations globally. All modeling was done in R. Mapping was done using R and ArcGIS 10.1.

Viewing Instructions: Load the files into an appropriate geographic information system (GIS). For the original download (ArcGIS geodatabase files), load the files into ArcGIS to view or export the data to other formats. Because these datasets are large and have a unique coordinate system that is not read by many GIS, we suggest loading them into an ArcGIS dataframe whose coordinate system matches that of the data (see File Format). For GeoTiff files (see Additional Versions), load them into any compatible GIS or image management program.

Comments: The original download provides a zipped folder that contains (1) an ArcGIS File Geodatabase (.gdb) containing one raster file for each of the two global models of tree density – one based on biomes and one based on ecoregions; (2) a layer file (.lyr) for each of the global models with the symbology used for each respective model in Crowther et al (2015); and an ArcGIS Map Document (.mxd) that contains the layers and symbology for each map in the paper. The data is delivered in the Goode homolosine interrupted projected coordinate system that was used to compute biome, ecoregion, and global estimates of the number and density of trees presented in Crowther et al (2015). To obtain maps like those presented in the official publication, raster files will need to be reprojected to the Eckert III projected coordinate system. Details on subsequent revisions and alternative file formats are list below under Additional Versions.----------

Additional Versions: Crowther_Nature_Files_Revision_01.zip contains tree density predictions for small islands that are not included in the data available in the original dataset. These predictions were not taken into consideration in production of maps and figures presented in Crowther et al (2015), with the exception of the values presented in Supplemental Table 2. The file structure follows that of the original data and includes both biome- and ecoregion-level models.

Crowther_Nature_Files_Revision_01_WGS84_GeoTiff.zip contains Revision_01 of the biome-level model, but stored in WGS84 and GeoTiff format. This file was produced by reprojecting the original Goode homolosine files to WGS84 using nearest neighbor resampling in ArcMap. All areal computations presented in the manuscript were computed using the Goode homolosine projection. This means that comparable computations made with projected versions of this WGS84 data are likely to differ (substantially at greater latitudes) as a product of the resampling. Included in this .zip file are the primary .tif and its visualization support files.

References:

Crowther, T. W., Glick, H. B., Covey, K. R., Bettigole, C., Maynard, D. S., Thomas, S. M., Smith, J. R., Hintler, G., Duguid, M. C., Amatulli, G., Tuanmu, M. N., Jetz, W., Salas, C., Stam, C., Piotto, D., Tavani, R., Green, S., Bruce, G., Williams, S. J., Wiser, S. K., Huber, M. O., Hengeveld, G. M., Nabuurs, G. J., Tikhonova, E., Borchardt, P., Li, C. F., Powrie, L. W., Fischer, M., Hemp, A., Homeier, J., Cho, P., Vibrans, A. C., Umunay, P. M., Piao, S. L., Rowe, C. W., Ashton, M. S., Crane, P. R., and Bradford, M. A. 2015. Mapping tree density at a global scale. Nature, 525(7568): 201-205. DOI: http://doi.org/10.1038/nature14967Glick, H. B., Bettigole, C. B., Maynard, D. S., Covey, K. R., Smith, J. R., and Crowther, T. W. 2016. Spatially explicit models of global tree density. Scientific Data, 3(160069), doi:10.1038/sdata.2016.69.
Data Carpentry Genomics Curriculum Example Data
figshare.com
datasetcatalog.nlm.nih.gov
application/gzip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Olivier Tenaillon; Jeffrey E Barrick; Noah Ribeck; Daniel E. Deatherage; Jeffrey L. Blanchard; Aurko Dasgupta; Gabriel C. Wu; Sébastien Wielgoss; Stéphane Cruvellier; Claudine Medigue; Dominique Schneider; Richard E. Lenski; Taylor Reiter; Jessica Mizzi; Fotis Psomopoulos; Ryan Peek; Jason Williams (2023). Data Carpentry Genomics Curriculum Example Data [Dataset]. http://doi.org/10.6084/m9.figshare.7726454.v3
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7726454.v3
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Olivier Tenaillon; Jeffrey E Barrick; Noah Ribeck; Daniel E. Deatherage; Jeffrey L. Blanchard; Aurko Dasgupta; Gabriel C. Wu; Sébastien Wielgoss; Stéphane Cruvellier; Claudine Medigue; Dominique Schneider; Richard E. Lenski; Taylor Reiter; Jessica Mizzi; Fotis Psomopoulos; Ryan Peek; Jason Williams
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 16.0px 'Andale Mono'; color: #29f914; background-color: #000000} span.s1 {font-variant-ligatures: no-common-ligatures} These files are intended for use with the Data Carpentry Genomics curriculum (https://datacarpentry.org/genomics-workshop/). Files will be useful for instructors teaching this curriculum in a workshop setting, as well as individuals working through these materials on their own.

This curriculum is normally taught using Amazon Web Services (AWS). Data Carpentry maintains an AWS image that includes all of the data files needed to use these lesson materials. For information on how to set up an AWS instance from that image, see https://datacarpentry.org/genomics-workshop/setup.html. Learners and instructors who would prefer to teach on a different remote computing system can access all required files from this FigShare dataset.

This curriculum uses data from a long term evolution experiment published in 2016: Tempo and mode of genome evolution in a 50,000-generation experiment (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4988878/) by Tenaillon O, Barrick JE, Ribeck N, Deatherage DE, Blanchard JL, Dasgupta A, Wu GC, Wielgoss S, Cruveiller S, Médigue C, Schneider D, and Lenski RE. (doi: 10.1038/nature18959). All sequencing data sets are available in the NCBI BioProject database under accession number PRJNA294072 (https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA294072).

backup.tar.gz: contains original fastq files, reference genome, and subsampled fastq files. Directions for obtaining these files from public databases are given during the lesson https://datacarpentry.org/wrangling-genomics/02-quality-control/index.html). On the AWS image, these files are stored in ~/.backup directory. 1.3Gb in size.

Ecoli_metadata.xlsx: an example Excel file to be loaded during the R lesson.

shell_data.tar.gz: contains the files used as input to the Introduction to the Command Line for Genomics lesson (https://datacarpentry.org/shell-genomics/).

sub.tar.gz: contains subsampled fastq files that are used as input to the Data Wrangling and Processing for Genomics lesson (https://datacarpentry.org/wrangling-genomics/). 109Mb in size.

solutions: contains the output files of the Shell Genomics and Wrangling Genomics lessons, including fastqc output, sam, bam, bcf, and vcf files.

vcf_clean_script.R: converts vcf output in .solutions/wrangling_solutions/variant_calling_auto to single tidy data frame.

combined_tidy_vcf.csv: output of vcf_clean_script.R
Reddit AskScience Flair Analysis Dataset
kaggle.com
zip
Updated Feb 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sumit Mishra (2025). Reddit AskScience Flair Analysis Dataset [Dataset]. https://www.kaggle.com/sumitm004/reddit-raskscience-flair-dataset
Explore at:
zip(47363279 bytes)Available download formats
Dataset updated
Feb 15, 2025
Authors
Sumit Mishra
License
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Description
Context

Reddit is a massive platform for news, content, and discussions, hosting millions of active users daily. Among its vast number of subreddits, we focus on the r/AskScience community, where users engage in science-related discussions and questions.

Content

This dataset is derived from the r/AskScience subreddit, collected between January 1, 2016, and May 20, 2022. It includes 612,668 datapoints across 22 columns, featuring diverse information such as the content of the questions, submission descriptions, associated flairs, NSFW/SFW status, year of submission, and more. The data was extracted using Python and Pushshift's API, followed by some cleaning with NumPy and pandas. Detailed column descriptions are available for clarity.

Mendeley Data

Ideas for Usage

Flair Prediction:Train models to predict post flairs (e.g., 'Science', 'Ask', 'Discussion') to automate content categorization for platforms like Reddit.

NSFW Classification: Classify posts as SFW or NSFW based on textual content, enabling content moderation tools for online forums.

Text Mining / NLP Tasks: Apply NLP techniques like Sentiment Analysis, Topic Modeling, and Text Classification to explore the content and themes of science-related discussions.

Community Engagement Analysis: Investigate which post types or flairs generate more engagement (e.g., upvotes or comments), offering insights into user interaction.

Trend Detection in Science Topics: Identify emerging science topics and analyze shifts in interest areas, which can help predict future trends in scientific discussions.
Z
Supplementary information for "Reassessment of French breeding bird...
datasetcatalog.nlm.nih.gov
Updated May 31, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lorrilliere, Romain; Barbaro, Luc; Dupuy, Jérémy; Vallé, Clément; Nabias, Jean; Fontaine, Benoît; Couzi, Laurent (2024). Supplementary information for "Reassessment of French breeding bird population sizes using citizen science and accounting for species detectability" [Dataset]. http://doi.org/10.5281/zenodo.11402818
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.11402818
Dataset updated
May 31, 2024
Authors
Lorrilliere, Romain; Barbaro, Luc; Dupuy, Jérémy; Vallé, Clément; Nabias, Jean; Fontaine, Benoît; Couzi, Laurent
Area covered
French
Description
Reproducibility data for the manuscript "Reassessment of French breeding bird population sizes using citizen science and accounting for species detectability", it contains data and script for : The R script 01_HDSfreq_Calibration.R of the developed approach to estimate national breeding bird population size using Hierarchical Distance Sampling (HDS) and the secondary candidate set model selection method (Morin et al., 2020) The R script 02_pglmm_figures.R for the calibration of the Phylogenetic Generalised Mixed Model (PGLMM) used in the manuscript to compare previous population size estimates to ones modelled using 01_HDSfreq_Calibration.R, while accounting for species phylogenetic relatedness The R script 03_results_tables.R, used to generate supplementary tables S2.1-3 and S6.1-2. Column names are highlighted in italics. Data description A. BirdPhylo_Burleigh_et_al.tre A phylogenetic tree from Burleigh et al., 2015. Phylogenetic distances are used as random effect for the PGLMM in script 02_pglmm_figures.R B. Conservation_status.txt A .txt file of the conservation status for France (Statut_FR) and Europe (Statut_EU) for the studied species retrieved from (UICN France et al., 2016). Only Statut_FR is used for the table S6.1. C. FBBS_trends_20122023.txt A .txt file containing species trend of the French Breeding Bird Survey data from 2012 to 2023. Species names (English, French) associated with FBBS trend estimated using data collected from 2012 - 2023 Hab_specialization, determined from Julliard et al. 2006 approach infPrec, supPerc, estimate, se, pval : Species trends over 2012-2023 period in % | lower and upper confidence intervals, mean, standard error and significance D. PrepData_HDS.RData A file containing .RData environment required to run 01_HDSfreq_Calibration.R script, it contains : ATLAS12 : A dataframe with breeding status information from 2012 breeding bird atlas (used to restrict model prediction grid, in regard of 2012 known breeding locations) ConcordTBL : A concordance table for species names (English, French and scientific notation) EPOC_ODF : observation dataset, each line corresponds to detected individuals UUID, Ref, ID_liste, ID, ID_place, Grid_10x10 : Columns used to identify observations, lists, sites, locations, 10x10 grids ID_species_Biolovision, Nom_espece, english_name, scientific_name : Species ID and names Date, Day, Month, Year, Julian_date, Obs_hour, Hour_list, Complete_checklist, Commentary, Project_name, Scheme, Observer, List_time, List_diversity, List_abundance : Lists and Observation related effort covariates and metadata X_Lambert93_m, Y_Lambert93_m : Observation locations in (crs = 2154) GPS_loc_observer : Logical, TRUE : location of observers corresponds to true GPS information ; FALSE : observer's location approximated as the barycenter of observations X_barycentre_L93, Y_barycentre_L93 : Observers location in (crs = 2154) Use_distance_sampling, Observation_distance_m, Distance_bin_logical, Distance_class_0_25, Distance_class_25_100, Distance_class_100_200, Distance_class_200_more : Distance sampling related informations Abudance_brut, Estimate, Number, Nb_male_identified, Nb_female_identified, Nb_juvenile_identified, Nb_grounded, Nb_flying, Nb_auditory, Nb_NA : Observation metadata, used in case of a priori filter over male detection. grid_pred_envvar : Prediction grid with environmental covariates, see appendix S3 of the manuscript, covering metropolitan France grid_pred.sf : corresponding sf object L93_10x10 : sf object corresponding to 10x10 grid used in 2012 atlas ObsVar_EPOCODF : dataframe specifying lists effort covariates OCCU_EPOC_ODF : Environmental covariate agregated over lists OCCU_EPOC_ODF_sites_envvar : Environmental covariate agregated over sites table.pheno : species table specifying related phenology filter E. ReadOutput_HDSfreq_comparison.csv A .csv table of species population size estimated using 2021-2023 EPOC-ODF data over areas determined as breeding in the 2012 atlas. Predictions were restrained over location known as breeding in 2012 for the sake of comparison. HDS_estimUnfenced_XXX : average pop. size estimated with confidence interval before prediction post-treatment (describe in fig 2. of the manuscript) HDS_estim_ExtrapolFence_XXX : average pop. size estimated with confidence interval after prediction post-treatment NB_data_calib : Number of observations (distance data, not sites) used for calibration MALE_FILTERING : (logical) indicating if female individuals could be detected in the same proportion of males during list recording. FALSE : we considered that estimated pop.size corresponded to the number of individuals leading to a division by 2 for the comparison with the previous atlas (in pairs). (cf . line 88-90 in 02_pglmm_figures.R) EcartDTF_filtrage_maleOnly : If MALE_FILTERING == T, proportion of the remaining data used for calibration after removal of list with individual tagged as female/juvenile (in %) Max_dist_breaks : Maximal distance for detection function, after right-side truncation of 5% Chat : Coefficient of overdisperion of the best model in the second candidate set MED_MEAN_Prob_Detect (.._SE) : weighted averaged median of intercept from the availability state from HDS models, weigthed AICc-wise MED_MEAN_Density : weighted averaged median of intercept from the abundance state from HDS models, weigthed AICc-wise Significant_phi/lambda : Categorial (Significant/Near/Not), are availability/abundance intercepts significatively different from 0 (significant : alpha = 0.05, near : alpha = 0.1) KeyFun_used : Key function used for distance sampling Mixtured_used : Mixture used in the abundance state for HDS F. ReadOutput_HDSfreq_comparison_20212022.csv A .csv table of species population size estimated using 2021-2022 EPOC-ODF data over areas determined as breeding in the 2012 atlas. Used for the robustness analysis of HDS estimated population size, see appendix S2 and table S2.2 of the manuscript. G. ReadOutput_HDSfreq_EstimMetropole.csv A .csv table of species population size estimated using 2021-2023 EPOC-ODF data over metropolitan France. HDS_estimUnfenced_XXX : average pop. size estimated with confidence interval before prediction post-treatment (describe in fig 2. of the manuscript) HDS_estim_ExtrapolFence_XXX : average pop. size estimated with confidence interval after prediction post-treatment MALE_FILTERING : (logical) indicating if female individuals could be detected in the same proportion of males during list recording. FALSE : we considered that estimated pop.size corresponded to the number of individuals leading to a division by 2 for conversion to pop. size in breeding pairs Chat : Coefficient of overdisperion of the best model in the second candidate set KeyFun_used : Key function used for distance sampling Mixtured_used : Mixture used in the abundance state for HDS H. TABLE_SpeciesFilters_and_2012Estimates.txt A .txt table containing species names (English, French and scientific notation), filters and 2012 French atlas pop. size estimates debut_jour : starting day of the month for phenology filter debut_mois : starting month for phenology filter fin_jour : ending day of the month for phenology filter fin_mois : ending month for phenology filter Estim_low/up_Atlas2012 : Lower and Upper interval of estimated pop. size in 2012 (number in breeding pairs) gregarious : logical (0,1) specifying if the species is considered gregarious during its breeding season I. sessionInfo_script_XX User R session information, obtained from sessionInfo() R function, used for running R script. Code A. 01_HDSfreq_Calibration.R R script showcasing data formatting and model calibration of the HDS based upon frequentist aproach from unmarked R package. For more details of the model calibration approach, see appendix S4 of the manuscript. B. 02_pglmm_figures.R Script for the calibration of the PGLMM and generation of figure 5 of the manuscript. C. 03_results_tables.R Script to generate tables depicted in Appendices S2 (S2.1-3) and S6 (S6.1-2) D. HDS_functions.R R script called in 01_HDSfreq_Calibration.R, contains 2 functions: Try_HDS() : Function implementing a try-catch permitting calibration of multiple species in a loop. Species with non convergent models are skipped sending a notification to the user R interface. Used in all sub-candidate sets (i.e. "null", "p", "phi", "lambda") When phase="ALL" corresponding to the second candidate set (i.e. ensemble of best model candidates, with delta_AIC <= 10, from previous sub-candidate sets), it permits the use of previous sub-candidates set coefficients as starting values, with StartValues argument Later part of the function hack the call of the unmarkedFit class, in order to accommodate from calibrating a gdistsamp using characters formulas fitstats() : Function from unmarked::parboot(), available with help(parboot). Allow estimation of multiple goodness-of-git statistic (Freeman-Tukey, Chi-squared and Sum of Squared Estimate of errors) through parametric bootstrap. In the manuscript, only chi-squared metric is used. E. dsmextra_modif_function.R R script called in 01_HDSfreq_Calibration.R. Miscellaneous adjustment of core function from dsmextra package (main change being the integration of tolerance argument (tol) in the chain of function. F. misc_unmarked.R R script called in 01_HDSfreq_Calibration.R. modify Setmethods for unmarked function, in particular for unmarked::parboot, allowing parallelization of parametric bootstrap with prior unmarked version (unmarked < 1.3.0). References Data was derived from the following sources: Burleigh, J.G., Kimball, R.T., Braun, E.L., 2015. Building the avian tree of life using a large-scale, sparse supermatrix. Molecular Phylogenetics and Evolution 84, 53–63. https://doi.org/10.1016/j.ympev.2014.12.003 Other sources : Julliard, R., Clavel, J., Devictor, V., Jiguet, F., Couvet, D., 2006. Spatial segregation of specialists and generalists in bird
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Gede Primahadi Wijaya Rajeg; Gede Primahadi Wijaya Rajeg (2019). R codes and dataset for Visualisation of Diachronic Constructional Change using Motion Chart [Dataset]. http://doi.org/10.26180/5c844c7a81768

R codes and dataset for Visualisation of Diachronic Constructional Change using Motion Chart

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

Unique identifier

https://doi.org/10.26180/5c844c7a81768

Dataset updated

Apr 1, 2019

Dataset provided by

Monash University

Authors

Gede Primahadi Wijaya Rajeg; Gede Primahadi Wijaya Rajeg

License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

Publication

Primahadi Wijaya R., Gede. 2014. Visualisation of diachronic constructional change using Motion Chart. In Zane Goebel, J. Herudjati Purwoko, Suharno, M. Suryadi & Yusuf Al Aried (eds.). Proceedings: International Seminar on Language Maintenance and Shift IV (LAMAS IV), 267-270. Semarang: Universitas Diponegoro. doi: https://doi.org/10.4225/03/58f5c23dd8387

Description of R codes and data files in the repository

This repository is imported from its GitHub repo. Versioning of this figshare repository is associated with the GitHub repo's Release. So, check the Releases page for updates (the next version is to include the unified version of the codes in the first release with the tidyverse).

The raw input data consists of two files (i.e. will_INF.txt and go_INF.txt). They represent the co-occurrence frequency of top-200 infinitival collocates for will and be going to respectively across the twenty decades of Corpus of Historical American English (from the 1810s to the 2000s).

These two input files are used in the R code file 1-script-create-input-data-raw.r. The codes preprocess and combine the two files into a long format data frame consisting of the following columns: (i) decade, (ii) coll (for "collocate"), (iii) BE going to (for frequency of the collocates with be going to) and (iv) will (for frequency of the collocates with will); it is available in the input_data_raw.txt.

Then, the script 2-script-create-motion-chart-input-data.R processes the input_data_raw.txt for normalising the co-occurrence frequency of the collocates per million words (the COHA size and normalising base frequency are available in coha_size.txt). The output from the second script is input_data_futurate.txt.

Next, input_data_futurate.txt contains the relevant input data for generating (i) the static motion chart as an image plot in the publication (using the script 3-script-create-motion-chart-plot.R), and (ii) the dynamic motion chart (using the script 4-script-motion-chart-dynamic.R).

The repository adopts the project-oriented workflow in RStudio; double-click on the Future Constructions.Rproj file to open an RStudio session whose working directory is associated with the contents of this repository.

Clear search

Close search

Google apps

Main menu

R codes and dataset for Visualisation of Diachronic Constructional Change...

Kickastarter Campaigns

Additional Tennessee Eastman Process Simulation Data for Anomaly Detection...

Replication Data for: The Wikipedia Adventure: Field Evaluation of an...

Data from: Russian Financial Statements Database: A firm-level collection of...

This line will download 6.6GB+ of all RFSD data and store it in a 🤗 cache folder

Alternatively, this will download ~540MB with all financial statements for 2023# to a Polars DataFrame (requires about 8GB of RAM)

Read RFSD metadata from local file

Use RFSD_dataset.schema to glimpse the data structure and columns' classes

Load full dataset into memory

Load only 2019 data into memory

Load only revenue for firms in 2019, identified by taxpayer id

Give suggested descriptive names to variables

Read RFSD metadata from local file

Use schema() to glimpse into the data structure and column classes

Load full dataset into memory

Load only 2019 data into memory

Load only revenue for firms in 2019, identified by taxpayer id

Give suggested descriptive names to variables

Tennessee Eastman Process Simulation Dataset

Intro

Content

Acknowledgements

User Agreement

FacialRecognition

head(d.train)

install.packages('foreach')

save(d.train, im.train, d.test, im.test, file='data.Rd')

load('data.Rd')

install.packages('reshape2')

Replication Data for Exploring an extinct society through the lens of...

EEA Air Quality In-Situ Measurement Station Data

Introduction

Accessing and pre-processing hourly data

Temporal aggregates

Data columns

Station meta data

Parquet file format

Market Basket Analysis

Market Basket Analysis

Introduction

An Example of Association Rules

Strategy

Dataset Description

Libraries in R

Data Pre-processing

All Lending Club loan data

Context

Background

Content

Update

Respiration_chambers/raw_log_files and combined datasets of biomass and...

Comprehensive Literary Greats Dataset

Comprehensive Literary Greats Dataset

50,000+ Books Rated and Awarded Across Language, Genre, and Format

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Data used in "A summer heatwave reduced activity, heart rate and autumn body...

Global map of tree density

Data Carpentry Genomics Curriculum Example Data

Reddit AskScience Flair Analysis Dataset

Context

Content

Ideas for Usage

Supplementary information for "Reassessment of French breeding bird...

R codes and dataset for Visualisation of Diachronic Constructional Change using Motion Chart