26 datasets found
  1. s

    Tomato Import Data | F And R Importing Company

    • seair.co.in
    Updated Feb 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seair Exim Solutions (2024). Tomato Import Data | F And R Importing Company [Dataset]. https://www.seair.co.in/us-import/product-tomato/i-f-and-r-importing-company.aspx
    Explore at:
    .text/.csv/.xml/.xls/.binAvailable download formats
    Dataset updated
    Feb 25, 2024
    Dataset authored and provided by
    Seair Exim Solutions
    Description

    Explore detailed Tomato import data of F And R Importing Company in the USA—product details, price, quantity, origin countries, and US ports.

  2. f

    Data from: Importing General-Purpose Graphics in R

    • figshare.com
    • auckland.figshare.com
    application/gzip
    Updated Sep 19, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Paul Murrell (2018). Importing General-Purpose Graphics in R [Dataset]. http://doi.org/10.17608/k6.auckland.7108736.v1
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Sep 19, 2018
    Dataset provided by
    The University of Auckland
    Authors
    Paul Murrell
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This report discusses some problems that can arise when attempting to import PostScript images into R, when the PostScript image contains coordinate transformations that skew the image. There is a description of some new features in the ‘grImport’ package for R that allow these sorts of images to be imported into R successfully.

  3. d

    Health and Retirement Study (HRS)

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Damico, Anthony (2023). Health and Retirement Study (HRS) [Dataset]. http://doi.org/10.7910/DVN/ELEKOY
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Damico, Anthony
    Description

    analyze the health and retirement study (hrs) with r the hrs is the one and only longitudinal survey of american seniors. with a panel starting its third decade, the current pool of respondents includes older folks who have been interviewed every two years as far back as 1992. unlike cross-sectional or shorter panel surveys, respondents keep responding until, well, death d o us part. paid for by the national institute on aging and administered by the university of michigan's institute for social research, if you apply for an interviewer job with them, i hope you like werther's original. figuring out how to analyze this data set might trigger your fight-or-flight synapses if you just start clicking arou nd on michigan's website. instead, read pages numbered 10-17 (pdf pages 12-19) of this introduction pdf and don't touch the data until you understand figure a-3 on that last page. if you start enjoying yourself, here's the whole book. after that, it's time to register for access to the (free) data. keep your username and password handy, you'll need it for the top of the download automation r script. next, look at this data flowchart to get an idea of why the data download page is such a righteous jungle. but wait, good news: umich recently farmed out its data management to the rand corporation, who promptly constructed a giant consolidated file with one record per respondent across the whole panel. oh so beautiful. the rand hrs files make much of the older data and syntax examples obsolete, so when you come across stuff like instructions on how to merge years, you can happily ignore them - rand has done it for you. the health and retirement study only includes noninstitutionalized adults when new respondents get added to the panel (as they were in 1992, 1993, 1998, 2004, and 2010) but once they're in, they're in - respondents have a weight of zero for interview waves when they were nursing home residents; but they're still responding and will continue to contribute to your statistics so long as you're generalizing about a population from a previous wave (for example: it's possible to compute "among all americans who were 50+ years old in 1998, x% lived in nursing homes by 2010"). my source for that 411? page 13 of the design doc. wicked. this new github repository contains five scripts: 1992 - 2010 download HRS microdata.R loop through every year and every file, download, then unzip everything in one big party impor t longitudinal RAND contributed files.R create a SQLite database (.db) on the local disk load the rand, rand-cams, and both rand-family files into the database (.db) in chunks (to prevent overloading ram) longitudinal RAND - analysis examples.R connect to the sql database created by the 'import longitudinal RAND contributed files' program create tw o database-backed complex sample survey object, using a taylor-series linearization design perform a mountain of analysis examples with wave weights from two different points in the panel import example HRS file.R load a fixed-width file using only the sas importation script directly into ram with < a href="http://blog.revolutionanalytics.com/2012/07/importing-public-data-with-sas-instructions-into-r.html">SAScii parse through the IF block at the bottom of the sas importation script, blank out a number of variables save the file as an R data file (.rda) for fast loading later replicate 2002 regression.R connect to the sql database created by the 'import longitudinal RAND contributed files' program create a database-backed complex sample survey object, using a taylor-series linearization design exactly match the final regression shown in this document provided by analysts at RAND as an update of the regression on pdf page B76 of this document . click here to view these five scripts for more detail about the health and retirement study (hrs), visit: michigan's hrs homepage rand's hrs homepage the hrs wikipedia page a running list of publications using hrs notes: exemplary work making it this far. as a reward, here's the detailed codebook for the main rand hrs file. note that rand also creates 'flat files' for every survey wave, but really, most every analysis you c an think of is possible using just the four files imported with the rand importation script above. if you must work with the non-rand files, there's an example of how to import a single hrs (umich-created) file, but if you wish to import more than one, you'll have to write some for loops yourself. confidential to sas, spss, stata, and sudaan users: a tidal wave is coming. you can get water up your nose and be dragged out to sea, or you can grab a surf board. time to transition to r. :D

  4. Z

    Storage and Transit Time Data and Code

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrew Felton (2024). Storage and Transit Time Data and Code [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8136816
    Explore at:
    Dataset updated
    Jun 12, 2024
    Dataset provided by
    Montana State University
    Authors
    Andrew Felton
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Author: Andrew J. FeltonDate: 5/5/2024

    This R project contains the primary code and data (following pre-processing in python) used for data production, manipulation, visualization, and analysis and figure production for the study entitled:

    "Global estimates of the storage and transit time of water through vegetation"

    Please note that 'turnover' and 'transit' are used interchangeably in this project.

    Data information:

    The data folder contains key data sets used for analysis. In particular:

    "data/turnover_from_python/updated/annual/multi_year_average/average_annual_turnover.nc" contains a global array summarizing five year (2016-2020) averages of annual transit, storage, canopy transpiration, and number of months of data. This is the core dataset for the analysis; however, each folder has much more data, including a dataset for each year of the analysis. Data are also available is separate .csv files for each land cover type. Oterh data can be found for the minimum, monthly, and seasonal transit time found in their respective folders. These data were produced using the python code found in the "supporting_code" folder given the ease of working with .nc and EASE grid in the xarray python module. R was used primarily for data visualization purposes. The remaining files in the "data" and "data/supporting_data"" folder primarily contain ground-based estimates of storage and transit found in public databases or through a literature search, but have been extensively processed and filtered here.

    Code information

    Python scripts can be found in the "supporting_code" folder.

    Each R script in this project has a particular function:

    01_start.R: This script loads the R packages used in the analysis, sets thedirectory, and imports custom functions for the project. You can also load in the main transit time (turnover) datasets here using the source() function.

    02_functions.R: This script contains the custom function for this analysis, primarily to work with importing the seasonal transit data. Load this using the source() function in the 01_start.R script.

    03_generate_data.R: This script is not necessary to run and is primarilyfor documentation. The main role of this code was to import and wranglethe data needed to calculate ground-based estimates of aboveground water storage.

    04_annual_turnover_storage_import.R: This script imports the annual turnover andstorage data for each landcover type. You load in these data from the 01_start.R scriptusing the source() function.

    05_minimum_turnover_storage_import.R: This script imports the minimum turnover andstorage data for each landcover type. Minimum is defined as the lowest monthlyestimate.You load in these data from the 01_start.R scriptusing the source() function.

    06_figures_tables.R: This is the main workhouse for figure/table production and supporting analyses. This script generates the key figures and summary statistics used in the study that then get saved in the manuscript_figures folder. Note that allmaps were produced using Python code found in the "supporting_code"" folder.

  5. B

    Replication Data for: Lameness during the dry period: epidemiology and...

    • borealisdata.ca
    Updated Sep 9, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ruan R. Daros; Hanna K. Eriksson; Daniel M. Weary; Marina A. G. von Keyserlingk (2019). Replication Data for: Lameness during the dry period: epidemiology and associated factors [Dataset]. http://doi.org/10.5683/SP2/YTZMKX
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 9, 2019
    Dataset provided by
    Borealis
    Authors
    Ruan R. Daros; Hanna K. Eriksson; Daniel M. Weary; Marina A. G. von Keyserlingk
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Original data, R script (code) and code output for the paper published on Journal of Dairy Science. For best use, replicate analysis using R. Importing data using the .csv file may cause some variables (columns of the spreadsheet) to be imported with the wrong format. Any issues, do not hesitate in contact. Happy coding!

  6. Z

    Data from: Russian Financial Statements Database: A firm-level collection of...

    • data.niaid.nih.gov
    Updated Mar 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bondarkov, Sergey; Ledenev, Victor; Skougarevskiy, Dmitriy (2025). Russian Financial Statements Database: A firm-level collection of the universe of financial statements [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14622208
    Explore at:
    Dataset updated
    Mar 14, 2025
    Dataset provided by
    European University at St. Petersburg
    European University at St Petersburg
    Authors
    Bondarkov, Sergey; Ledenev, Victor; Skougarevskiy, Dmitriy
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    The Russian Financial Statements Database (RFSD) is an open, harmonized collection of annual unconsolidated financial statements of the universe of Russian firms:

    • 🔓 First open data set with information on every active firm in Russia.

    • 🗂️ First open financial statements data set that includes non-filing firms.

    • 🏛️ Sourced from two official data providers: the Rosstat and the Federal Tax Service.

    • 📅 Covers 2011-2023 initially, will be continuously updated.

    • 🏗️ Restores as much data as possible through non-invasive data imputation, statement articulation, and harmonization.

    The RFSD is hosted on 🤗 Hugging Face and Zenodo and is stored in a structured, column-oriented, compressed binary format Apache Parquet with yearly partitioning scheme, enabling end-users to query only variables of interest at scale.

    The accompanying paper provides internal and external validation of the data: http://arxiv.org/abs/2501.05841.

    Here we present the instructions for importing the data in R or Python environment. Please consult with the project repository for more information: http://github.com/irlcode/RFSD.

    Importing The Data

    You have two options to ingest the data: download the .parquet files manually from Hugging Face or Zenodo or rely on 🤗 Hugging Face Datasets library.

    Python

    🤗 Hugging Face Datasets

    It is as easy as:

    from datasets import load_dataset import polars as pl

    This line will download 6.6GB+ of all RFSD data and store it in a 🤗 cache folder

    RFSD = load_dataset('irlspbru/RFSD')

    Alternatively, this will download ~540MB with all financial statements for 2023# to a Polars DataFrame (requires about 8GB of RAM)

    RFSD_2023 = pl.read_parquet('hf://datasets/irlspbru/RFSD/RFSD/year=2023/*.parquet')

    Please note that the data is not shuffled within year, meaning that streaming first n rows will not yield a random sample.

    Local File Import

    Importing in Python requires pyarrow package installed.

    import pyarrow.dataset as ds import polars as pl

    Read RFSD metadata from local file

    RFSD = ds.dataset("local/path/to/RFSD")

    Use RFSD_dataset.schema to glimpse the data structure and columns' classes

    print(RFSD.schema)

    Load full dataset into memory

    RFSD_full = pl.from_arrow(RFSD.to_table())

    Load only 2019 data into memory

    RFSD_2019 = pl.from_arrow(RFSD.to_table(filter=ds.field('year') == 2019))

    Load only revenue for firms in 2019, identified by taxpayer id

    RFSD_2019_revenue = pl.from_arrow( RFSD.to_table( filter=ds.field('year') == 2019, columns=['inn', 'line_2110'] ) )

    Give suggested descriptive names to variables

    renaming_df = pl.read_csv('local/path/to/descriptive_names_dict.csv') RFSD_full = RFSD_full.rename({item[0]: item[1] for item in zip(renaming_df['original'], renaming_df['descriptive'])})

    R

    Local File Import

    Importing in R requires arrow package installed.

    library(arrow) library(data.table)

    Read RFSD metadata from local file

    RFSD <- open_dataset("local/path/to/RFSD")

    Use schema() to glimpse into the data structure and column classes

    schema(RFSD)

    Load full dataset into memory

    scanner <- Scanner$create(RFSD) RFSD_full <- as.data.table(scanner$ToTable())

    Load only 2019 data into memory

    scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scanner <- scan_builder$Finish() RFSD_2019 <- as.data.table(scanner$ToTable())

    Load only revenue for firms in 2019, identified by taxpayer id

    scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scan_builder$Project(cols = c("inn", "line_2110")) scanner <- scan_builder$Finish() RFSD_2019_revenue <- as.data.table(scanner$ToTable())

    Give suggested descriptive names to variables

    renaming_dt <- fread("local/path/to/descriptive_names_dict.csv") setnames(RFSD_full, old = renaming_dt$original, new = renaming_dt$descriptive)

    Use Cases

    🌍 For macroeconomists: Replication of a Bank of Russia study of the cost channel of monetary policy in Russia by Mogiliat et al. (2024) — interest_payments.md

    🏭 For IO: Replication of the total factor productivity estimation by Kaukin and Zhemkova (2023) — tfp.md

    🗺️ For economic geographers: A novel model-less house-level GDP spatialization that capitalizes on geocoding of firm addresses — spatialization.md

    FAQ

    Why should I use this data instead of Interfax's SPARK, Moody's Ruslana, or Kontur's Focus?hat is the data period?

    To the best of our knowledge, the RFSD is the only open data set with up-to-date financial statements of Russian companies published under a permissive licence. Apart from being free-to-use, the RFSD benefits from data harmonization and error detection procedures unavailable in commercial sources. Finally, the data can be easily ingested in any statistical package with minimal effort.

    What is the data period?

    We provide financials for Russian firms in 2011-2023. We will add the data for 2024 by July, 2025 (see Version and Update Policy below).

    Why are there no data for firm X in year Y?

    Although the RFSD strives to be an all-encompassing database of financial statements, end users will encounter data gaps:

    We do not include financials for firms that we considered ineligible to submit financial statements to the Rosstat/Federal Tax Service by law: financial, religious, or state organizations (state-owned commercial firms are still in the data).

    Eligible firms may enjoy the right not to disclose under certain conditions. For instance, Gazprom did not file in 2022 and we had to impute its 2022 data from 2023 filings. Sibur filed only in 2023, Novatek — in 2020 and 2021. Commercial data providers such as Interfax's SPARK enjoy dedicated access to the Federal Tax Service data and therefore are able source this information elsewhere.

    Firm may have submitted its annual statement but, according to the Uniform State Register of Legal Entities (EGRUL), it was not active in this year. We remove those filings.

    Why is the geolocation of firm X incorrect?

    We use Nominatim to geocode structured addresses of incorporation of legal entities from the EGRUL. There may be errors in the original addresses that prevent us from geocoding firms to a particular house. Gazprom, for instance, is geocoded up to a house level in 2014 and 2021-2023, but only at street level for 2015-2020 due to improper handling of the house number by Nominatim. In that case we have fallen back to street-level geocoding. Additionally, streets in different districts of one city may share identical names. We have ignored those problems in our geocoding and invite your submissions. Finally, address of incorporation may not correspond with plant locations. For instance, Rosneft has 62 field offices in addition to the central office in Moscow. We ignore the location of such offices in our geocoding, but subsidiaries set up as separate legal entities are still geocoded.

    Why is the data for firm X different from https://bo.nalog.ru/?

    Many firms submit correcting statements after the initial filing. While we have downloaded the data way past the April, 2024 deadline for 2023 filings, firms may have kept submitting the correcting statements. We will capture them in the future releases.

    Why is the data for firm X unrealistic?

    We provide the source data as is, with minimal changes. Consider a relatively unknown LLC Banknota. It reported 3.7 trillion rubles in revenue in 2023, or 2% of Russia's GDP. This is obviously an outlier firm with unrealistic financials. We manually reviewed the data and flagged such firms for user consideration (variable outlier), keeping the source data intact.

    Why is the data for groups of companies different from their IFRS statements?

    We should stress that we provide unconsolidated financial statements filed according to the Russian accounting standards, meaning that it would be wrong to infer financials for corporate groups with this data. Gazprom, for instance, had over 800 affiliated entities and to study this corporate group in its entirety it is not enough to consider financials of the parent company.

    Why is the data not in CSV?

    The data is provided in Apache Parquet format. This is a structured, column-oriented, compressed binary format allowing for conditional subsetting of columns and rows. In other words, you can easily query financials of companies of interest, keeping only variables of interest in memory, greatly reducing data footprint.

    Version and Update Policy

    Version (SemVer): 1.0.0.

    We intend to update the RFSD annualy as the data becomes available, in other words when most of the firms have their statements filed with the Federal Tax Service. The official deadline for filing of previous year statements is April, 1. However, every year a portion of firms either fails to meet the deadline or submits corrections afterwards. Filing continues up to the very end of the year but after the end of April this stream quickly thins out. Nevertheless, there is obviously a trade-off between minimization of data completeness and version availability. We find it a reasonable compromise to query new data in early June, since on average by the end of May 96.7% statements are already filed, including 86.4% of all the correcting filings. We plan to make a new version of RFSD available by July.

    Licence

    Creative Commons License Attribution 4.0 International (CC BY 4.0).

    Copyright © the respective contributors.

    Citation

    Please cite as:

    @unpublished{bondarkov2025rfsd, title={{R}ussian {F}inancial {S}tatements {D}atabase}, author={Bondarkov, Sergey and Ledenev, Victor and Skougarevskiy, Dmitriy}, note={arXiv preprint arXiv:2501.05841}, doi={https://doi.org/10.48550/arXiv.2501.05841}, year={2025}}

    Acknowledgments and Contacts

    Data collection and processing: Sergey Bondarkov, sbondarkov@eu.spb.ru, Viktor Ledenev, vledenev@eu.spb.ru

    Project conception, data validation, and use cases: Dmitriy Skougarevskiy, Ph.D.,

  7. f

    Supplement 1. Example data and R code.

    • wiley.figshare.com
    html
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael L. Collyer; Dean C. Adams (2023). Supplement 1. Example data and R code. [Dataset]. http://doi.org/10.6084/m9.figshare.3527483.v1
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    Wiley
    Authors
    Michael L. Collyer; Dean C. Adams
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    File List collyer_adams_Rcode.txt -- R code for running analysis collyer_adams_example_data.csv -- example data to input into R routine collyer_adams_example_xmat.csv -- coding for the design matrix used collyer_adams_ESA_supplement.zip -- all files at once

    Description The collyer_adams_Rcode.txt file contains a procedure for performing the analysis described in Appendix A, using R. The procedure imports data and a design matrix (collyer_adams_example_data.csv and collyer_adams_example_xmat.csv are provided, and correspond to the example in Appendix A). The default number of permutations is 999, but can be changed. A matrix of random values (distances, contrasts, angles) and a results summary are created from the program. Users should be aware that importing different data sets will require altering some of the R code to accommodate their data (e.g., matrix dimensions would need to be changed).

  8. Data from: Air Quality Data

    • kaggle.com
    zip
    Updated Nov 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LauraMVC (2025). Air Quality Data [Dataset]. https://www.kaggle.com/datasets/lauramvc/air-quality-data
    Explore at:
    zip(529662 bytes)Available download formats
    Dataset updated
    Nov 7, 2025
    Authors
    LauraMVC
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Hi, absolute beginner here! This is my very first project on Kaggle. I chose this dataset simply to start exploring the platform and learn how to work with real-world data. My goal is to practice importing, cleaning, and visualizing sensor data using R, and to understand how Kaggle notebooks and workflows operate. This dataset caught my attention because it combines environmental data with an academic context, which I find both meaningful and technically interesting. I'm still learning, so any feedback or suggestions are welcome!

  9. Data files and R code

    • figshare.com
    txt
    Updated Aug 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thomas M.B. Kirkwood (2025). Data files and R code [Dataset]. http://doi.org/10.6084/m9.figshare.29861111.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Aug 8, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Thomas M.B. Kirkwood
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data files and R code supporting "The terrestrial–aquatic transition impacts endocranial shape in caniform carnivorans" (Kirkwood et al., under review) are provided in this folder. The contents include: an R script for all analyses performed in this study; FCSV .txt files for importing landmark data; FCSV .txt files of templates; a template mesh (.obj); Excel metadata files; Excel files of paired landmarks for mirroring; a phylogenetic tree of Carnivora (Faurby et al., 2024); a mesh (.obj) for warping (Eira barbara); and an R script for resampling landmarks from Botton-Divet et al. (2016). The 3D endocranial meshes used in this study will be made available on request via Morphosource.

  10. Case Study: Cyclist

    • kaggle.com
    zip
    Updated Jul 27, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PatrickRCampbell (2021). Case Study: Cyclist [Dataset]. https://www.kaggle.com/patrickrcampbell/case-study-cyclist
    Explore at:
    zip(193057270 bytes)Available download formats
    Dataset updated
    Jul 27, 2021
    Authors
    PatrickRCampbell
    Description

    Phase 1: ASK

    Key Objectives:

    1. Business Task * Cyclist is looking to increase their earnings, and wants to know if creating a social media campaign can influence "Casual" users to become "Annual" members.

    2. Key Stakeholders: * The main stakeholder from Cyclist is Lily Moreno, whom is the Director of Marketing and responsible for the development of campaigns and initiatives to promote their bike-share program. The other teams involved with this project will be Marketing & Analytics, and the Executive Team.

    3. Business Task: * Comparing the two kinds of users and defining how they use the platform, what variables they have in common, what variables are different, and how can they get Casual users to become Annual members

    Phase 2: PREPARE:

    Key Objectives:

    1. Determine Data Credibility * Cyclist provided data from years 2013-2021 (through March 2021), all of which is first-hand data collected by the company.

    2. Sort & Filter Data: * The stakeholders want to know how the current users are using their service, so I am focusing on using the data from 2020-2021 since this is the most relevant period of time to answer the business task.

    #Installing packages
    install.packages("tidyverse", repos = "http://cran.us.r-project.org")
    install.packages("readr", repos = "http://cran.us.r-project.org")
    install.packages("janitor", repos = "http://cran.us.r-project.org")
    install.packages("geosphere", repos = "http://cran.us.r-project.org")
    install.packages("gridExtra", repos = "http://cran.us.r-project.org")
    
    library(tidyverse)
    library(readr)
    library(janitor)
    library(geosphere)
    library(gridExtra)
    
    #Importing data & verifying the information within the dataset
    all_tripdata_clean <- read.csv("/Data Projects/cyclist/cyclist_data_cleaned.csv")
    
    glimpse(all_tripdata_clean)
    
    summary(all_tripdata_clean)
    
    

    Phase 3: PROCESS

    Key Objectives:

    1. Cleaning Data & Preparing for Analysis: * Once the data has been placed into one dataset, and checked for errors, we began cleaning the data. * Eliminating data that correlates to the company servicing the bikes, and any ride with a traveled distance of zero. * New columns will be added to assist in the analysis, and to provide accurate assessments of whom is using the bikes.

    #Eliminating any data that represents the company performing maintenance, and trips without any measureable distance
    all_tripdata_clean <- all_tripdata_clean[!(all_tripdata_clean$start_station_name == "HQ QR" | all_tripdata_clean$ride_length<0),] 
    
    #Creating columns for the individual date components (days_of_week should be run last)
    all_tripdata_clean$day_of_week <- format(as.Date(all_tripdata_clean$date), "%A")
    all_tripdata_clean$date <- as.Date(all_tripdata_clean$started_at)
    all_tripdata_clean$day <- format(as.Date(all_tripdata_clean$date), "%d")
    all_tripdata_clean$month <- format(as.Date(all_tripdata_clean$date), "%m")
    all_tripdata_clean$year <- format(as.Date(all_tripdata_clean$date), "%Y")
    
    

    ** Now I will begin calculating the length of rides being taken, distance traveled, and the mean amount of time & distance.**

    #Calculating the ride length in miles & minutes
    all_tripdata_clean$ride_length <- difftime(all_tripdata_clean$ended_at,all_tripdata_clean$started_at,units = "mins")
    
    all_tripdata_clean$ride_distance <- distGeo(matrix(c(all_tripdata_clean$start_lng, all_tripdata_clean$start_lat), ncol = 2), matrix(c(all_tripdata_clean$end_lng, all_tripdata_clean$end_lat), ncol = 2))
    all_tripdata_clean$ride_distance = all_tripdata_clean$ride_distance/1609.34 #converting to miles
    
    #Calculating the mean time and distance based on the user groups
    userType_means <- all_tripdata_clean %>% group_by(member_casual) %>% summarise(mean_time = mean(ride_length))
    
    
    userType_means <- all_tripdata_clean %>% 
     group_by(member_casual) %>% 
     summarise(mean_time = mean(ride_length),mean_distance = mean(ride_distance))
    

    Adding in calculations that will differentiate between bike types and which type of user is using each specific bike type.

    #Calculations
    
    with_bike_type <- all_tripdata_clean %>% filter(rideable_type=="classic_bike" | rideable_type=="electric_bike")
    
    with_bike_type %>%
     mutate(weekday = wday(started_at, label = TRUE)) %>% 
     group_by(member_casual,rideable_type,weekday) %>%
     summarise(totals=n(), .groups="drop") %>%
     
    with_bike_type %>%
     group_by(member_casual,rideable_type) %>%
     summarise(totals=n(), .groups="drop") %>%
    
     #Calculating the ride differential
     
     all_tripdata_clean %>% 
     mutate(weekday = wkday(started_at, label = TRUE)) %>% 
     group_by(member_casual, weekday) %>% 
     summarise(number_of_rides = n()
          ,average_duration = mean(ride_length),.groups = 'drop') %>% 
     arrange(me...
    
  11. d

    Data and code from: Severity of charcoal rot disease in soybean genotypes...

    • catalog.data.gov
    • datasetcatalog.nlm.nih.gov
    • +1more
    Updated May 8, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Data and code from: Severity of charcoal rot disease in soybean genotypes inoculated with Macrophomina phaseolina isolates differs among growth environments [Dataset]. https://catalog.data.gov/dataset/data-and-code-from-severity-of-charcoal-rot-disease-in-soybean-genotypes-inoculated-with-i
    Explore at:
    Dataset updated
    May 8, 2025
    Dataset provided by
    Agricultural Research Service
    Description

    This dataset includes all the raw data and all the R statistical software code that we used to analyze the data and produce all the outputs that are in the figures, tables, and text of the associated manuscript:Mengistu, A., Q. D. Read, C. R. Little, H. M. Kelly, P. M. Henry, and N. Bellaloui. 2025. Severity of charcoal rot disease in soybean genotypes inoculated with Macrophomina phaseolina isolates differs among growth environments. Plant Disease. DOI: 10.1094/PDIS-10-24-2230-RE.The data included here come from a series of tests designed to evaluate methods for identifying soybean genotypes that are resistant or susceptible to charcoal rot, a widespread and economically significant disease. Four independent experiments were performed to determine the variability in disease severity by soybean genotype and by isolated variant of the charcoal rot fungus: two field tests, a greenhouse test, and a growth chamber test. The tests differed in the number of genotypes and isolates used, as well as the method of inoculation. The accuracy of identifying resistant and susceptible genotypes varied by study, and the same isolate tested across different studies often had highly variable disease severity. Our results indicate that the non-field methods are not reliable ways to identify sources of charcoal rot resistance in soybean.The models fit in the R script archived here are Bayesian general linear mixed models with AUDPC (area under the disease progress curve) as the response variable. One-dimensional clustering is used to divide the genotypes into resistant and susceptible based on their model-predicted AUDPC values, and this result is compared with the preexisting resistance classification. Posterior distributions of the marginal means for different combinations of genotype, isolate, and other covariates are estimated and compared. Code to reproduce the tables and figures of the manuscript is also included.The following files are included:README.pdf: Full description, with column metadata for the data spreadsheets and text description of each R scriptdata2023-04-18.xlsx: Excel sheet with data from three of the four trialscleaned_data.RData: all data in analysis-ready format; generates a set of data frames when imported into an R environmentModified Cut-Tip Inoculation on DT974290 and LS980358 on first 32 isolates.xlsx: Excel spreadsheet with data from the fourth trialdata_cleaning.R: Script required to format data from .xlsx files into analysis-ready format (running this script is not necessary to reproduce the analysis; instead you may begin with the following script importing the cleaned .RData object)AUDPC_fits.R: Script containing code for all model fitting, model predictions and comparisons, and figure and table generation

  12. IRIS data set for Beginners

    • kaggle.com
    zip
    Updated Jul 11, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sanjeet Kumar Yadav (2018). IRIS data set for Beginners [Dataset]. https://www.kaggle.com/datasets/sanjeet41/iris-data-set-for-beginners/data
    Explore at:
    zip(1291 bytes)Available download formats
    Dataset updated
    Jul 11, 2018
    Authors
    Sanjeet Kumar Yadav
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    The iris data set is inbuilt data set in r studio where many people can perform many operations such as Data Exporting, Data importing, View data, structure of iris data set , names of column, type of iris and for different visualization techniques. There's a story behind every data set and here is an opportunity to share with you.

    Content

    In this data set 150 row and 5 columns

    Acknowledgements

    We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

    Inspiration

    Your data will be in front of the world's largest data science community. What questions do you want to see answered?

  13. Data from: Composition of Foods Raw, Processed, Prepared USDA National...

    • agdatacommons.nal.usda.gov
    • datasetcatalog.nlm.nih.gov
    • +4more
    pdf
    Updated Nov 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David B. Haytowitz; Jaspreet K.C. Ahuja; Bethany Showell; Meena Somanchi; Melissa Nickle; Quynh Anh Nguyen; Juhi R. Williams; Janet M. Roseland; Mona Khan; Kristine Y. Patterson; Jacob Exler; Shirley Wasswa-Kintu; Robin Thomas; Pamela R. Pehrsson (2025). Composition of Foods Raw, Processed, Prepared USDA National Nutrient Database for Standard Reference, Release 28 [Dataset]. http://doi.org/10.15482/USDA.ADC/1324304
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Nov 21, 2025
    Dataset provided by
    Agricultural Research Servicehttps://www.ars.usda.gov/
    Authors
    David B. Haytowitz; Jaspreet K.C. Ahuja; Bethany Showell; Meena Somanchi; Melissa Nickle; Quynh Anh Nguyen; Juhi R. Williams; Janet M. Roseland; Mona Khan; Kristine Y. Patterson; Jacob Exler; Shirley Wasswa-Kintu; Robin Thomas; Pamela R. Pehrsson
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    [Note: Integrated as part of FoodData Central, April 2019.] The database consists of several sets of data: food descriptions, nutrients, weights and measures, footnotes, and sources of data. The Nutrient Data file contains mean nutrient values per 100 g of the edible portion of food, along with fields to further describe the mean value. Information is provided on household measures for food items. Weights are given for edible material without refuse. Footnotes are provided for a few items where information about food description, weights and measures, or nutrient values could not be accommodated in existing fields. Data have been compiled from published and unpublished sources. Published data sources include the scientific literature. Unpublished data include those obtained from the food industry, other government agencies, and research conducted under contracts initiated by USDA’s Agricultural Research Service (ARS). Updated data have been published electronically on the USDA Nutrient Data Laboratory (NDL) web site since 1992. Standard Reference (SR) 28 includes composition data for all the food groups and nutrients published in the 21 volumes of "Agriculture Handbook 8" (US Department of Agriculture 1976-92), and its four supplements (US Department of Agriculture 1990-93), which superseded the 1963 edition (Watt and Merrill, 1963). SR28 supersedes all previous releases, including the printed versions, in the event of any differences. Attribution for photos: Photo 1: k7246-9 Copyright free, public domain photo by Scott Bauer Photo 2: k8234-2 Copyright free, public domain photo by Scott Bauer Resources in this dataset:Resource Title: READ ME - Documentation and User Guide - Composition of Foods Raw, Processed, Prepared - USDA National Nutrient Database for Standard Reference, Release 28. File Name: sr28_doc.pdfResource Software Recommended: Adobe Acrobat Reader,url: http://www.adobe.com/prodindex/acrobat/readstep.html Resource Title: ASCII (6.0Mb; ISO/IEC 8859-1). File Name: sr28asc.zipResource Description: Delimited file suitable for importing into many programs. The tables are organized in a relational format, and can be used with a relational database management system (RDBMS), which will allow you to form your own queries and generate custom reports.Resource Title: ACCESS (25.2Mb). File Name: sr28db.zipResource Description: This file contains the SR28 data imported into a Microsoft Access (2007 or later) database. It includes relationships between files and a few sample queries and reports.Resource Title: ASCII (Abbreviated; 1.1Mb; ISO/IEC 8859-1). File Name: sr28abbr.zipResource Description: Delimited file suitable for importing into many programs. This file contains data for all food items in SR28, but not all nutrient values--starch, fluoride, betaine, vitamin D2 and D3, added vitamin E, added vitamin B12, alcohol, caffeine, theobromine, phytosterols, individual amino acids, individual fatty acids, or individual sugars are not included. These data are presented per 100 grams, edible portion. Up to two household measures are also provided, allowing the user to calculate the values per household measure, if desired.Resource Title: Excel (Abbreviated; 2.9Mb). File Name: sr28abxl.zipResource Description: For use with Microsoft Excel (2007 or later), but can also be used by many other spreadsheet programs. This file contains data for all food items in SR28, but not all nutrient values--starch, fluoride, betaine, vitamin D2 and D3, added vitamin E, added vitamin B12, alcohol, caffeine, theobromine, phytosterols, individual amino acids, individual fatty acids, or individual sugars are not included. These data are presented per 100 grams, edible portion. Up to two household measures are also provided, allowing the user to calculate the values per household measure, if desired.Resource Software Recommended: Microsoft Excel,url: https://www.microsoft.com/ Resource Title: ASCII (Update Files; 1.1Mb; ISO/IEC 8859-1). File Name: sr28upd.zipResource Description: Update Files - Contains updates for those users who have loaded Release 27 into their own programs and wish to do their own updates. These files contain the updates between SR27 and SR28. Delimited file suitable for import into many programs.

  14. d

    Data from: Negotiating mutualism: a locus for exploitation by rhizobia has a...

    • datadryad.org
    • data.niaid.nih.gov
    • +2more
    zip
    Updated Apr 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Camille Wendlandt; Miles Roberts; Kyle Nguyen; Marion Graham; Zoie Lopez; Emily Helliwell; Maren Friesen; Joel Griffitts; Paul Price; Stephanie Porter (2022). Negotiating mutualism: a locus for exploitation by rhizobia has a broad effect size distribution and context-dependent effects on legume hosts [Dataset]. http://doi.org/10.5061/dryad.bk3j9kdf9
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 22, 2022
    Dataset provided by
    Dryad
    Authors
    Camille Wendlandt; Miles Roberts; Kyle Nguyen; Marion Graham; Zoie Lopez; Emily Helliwell; Maren Friesen; Joel Griffitts; Paul Price; Stephanie Porter
    Time period covered
    Apr 8, 2022
    Description

    We provide four datasets as csv files and one R code file that contains code for analyzing the data in these files. See the "README" file for metadata for each dataset. The contents of each file are as follows: * 2018_knockout_CFU_data.csv contains nodule culturing data (counts of colony forming units) from the 2018 Knockout Experiment * 2018_knockout_greenhouse_data.csv contains plant harvest data from the 2018 Knockout Experiment * 2019_GxG_knockout_CFU_data_330plants.csv contains nodule culturing data (counts of colony forming units) from the 2019 GxG Knockout Experiment * 2019_GxG_knockout_greenhouse_data_330plants.csv contains plant harvest data from the 2019 GxG Knockout Experiment * Wendlandt_et_al_2022_JEvolBiol_code.R contains R code for importing, processing, and analyzing data from the above four datasets. It also contains code for producing figures from these data.

  15. f

    Supplement 1. R code for running PnET-CN simulations for AmeriFlux sites, as...

    • figshare.com
    • wiley.figshare.com
    html
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexandra M. Thorn; Jingfeng Xiao; Scott V. Ollinger (2023). Supplement 1. R code for running PnET-CN simulations for AmeriFlux sites, as in text. [Dataset]. http://doi.org/10.6084/m9.figshare.3563955.v1
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Wiley
    Authors
    Alexandra M. Thorn; Jingfeng Xiao; Scott V. Ollinger
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    File List README (MD5: c3c7b6e8b8c2e42f967dff8c15b6d9e6) SiteAndVegParams.RData (MD5: 86dcf70da2c1e4766cde4a222b78f80b) RunSimulations2014-06-11.R (MD5: a7ad23be70e82619e104dbe42746820b) AmerifluxToClimateFile.R (MD5: 093dfe9d2ff6e61fd2e1b92615d60a70) pnetcn_NewPhenology.R (MD5: 7e069c490ec0321a0572c7f73943e96a) pnetcn_NewPhenology_cont.R (MD5: 44d4689a457b964f48e497f9b4d3d0df) pnetcn.R (MD5: d4168f06879c698f46b12d4dfb748d74) spinup_pnetcn_NewPhenology.R (MD5: a64a22fd94e06389292c8c417d2d75d8) spinup_pnetcn.R (MD5: 1dc1a10f75c4e14e8108919e0a2d31a3) run_pnetcn.R (MD5: 204dcd9176dfe57b83474fc4677253d0) AllocateMo.R (MD5: 7bd2c8e7c2573b64fddef23878be7be6) AllocateYr_NewPhenology.R (MD5: 68c4686f579ae19453fbdd73fdaca927) AllocateYr.R (MD5: 17e49c1aaf9d3d045af11344683eaa45) AtmEnviron_NewPhenologyOLD.R (MD5: f9f9ce07b2d1d29b719f58c0c2bf5eec) AtmEnviron_NewPhenology.R (MD5: 6d11dce44856009c05af514b8960cbf5) AtmEnviron.R (MD5: 6a352d22899c3fe1c10cecd2f2bae41d) CalculateYr_NewPhenology.R (MD5: b0d9d700ba3c9503c8d280a6bd50b0f4) CNTrans.R (MD5: 15f7ea8951957fe0b5714d3164bcb739) Decomp.R (MD5: 6f851eea37ba654a2434a3a9df4f7160) GrowthInit_NewPhenology.R (MD5: ce50ce5fa8a8282e7ad38ec913c9f948) initvars.R (MD5: 8503263de5d593fde793622256556209) Leach.R (MD5: 49667570a1eae670cb4121f7ca9bfd64) PhenologyNew.R (MD5: 088b2c1b406fab16c4bc55c27c85a5c6) PhenologyNew2.R (MD5: 6d881331616e178c4b79d0bda723a934) Phenology.R (MD5: 07f72b99a8e981d76100548247d948f2) Photosyn.R (MD5: 44755415a892bedc86b0f6bad0f50853) SoilResp.R (MD5: 018dc6e0d4eb5a5f66bb46179099f74b) storeoutput.R (MD5: f527f8a285c169bdb53aa518f15880c5) StoreYrOutput.R (MD5: 984cbf2a2f518b5795c319e48bac1ac4) Waterbal.R (MD5: c3d9d4516e009b9fad7d08ac488f2abc) YearInit_NewPhenology.R (MD5: 0e9957a828f64f25495bdca9c409d070) YearInit.R (MD5: ceb4bab739a52b78cfcf7ca411939174) Ameriflux/ Daymet/ FluxDataFunctions.R (MD5: ae89d6b9cf4e01ba501941ce12690e4b)

      Description
        README - Brief notes to get started
        SiteAndVegParams.RData – R data structures containing site-specific parameters for the six sites analyzed in the paper
        RunSimulations2014-06-11.R – Template code for importing AmeriFlux and Daymet data and running simulations with climate files including spinup years (must be modified)
        AmerifluxToClimateFile.R – Functions needed to import Ameriflux and Daymet data and generate climate files
        pnetcn_NewPhenology.R - Top level function to run the version of PnET-CN used in the paper (new phenology routine)
        pnetcn.R - Top level function to run traditional version of PnET-CN (old phenology routine)
        spinup_pnetcn_NewPhenology.R – Alternative version of top-level function that repeats the climate data from the input climate file an arbitrary number of times for spinup (new phenology routine)
        pnetcn_NewPhenology_cont.R – Function to continue a simulation starting with spinup data from a previous run (called by spinup_pnetcn_NewPhenology.R)
        spinup_pnetcn.R - Alternative version of top-level function for traditional version of PnET-CN that repeats the climate data from the input climate file an arbitrary number of times for spinup (old phenology routine)
        run_pnetcn.R - Functions to run PnET-CN with output as data frames of monthly or annual data instead of as a list containing both formats
        AllocateMo.R - Monthly allocation routine for PnET-CN
        AllocateYr_NewPhenology.R - Yearly allocation routine for PnET-CN (with new phenology routine)
        AllocateYr.R - Yearly allocation routine for PnET-CN (with old phenology routine)
        AtmEnviron_NewPhenology.R - Environmental calculations for PnET-CN (for new phenology routine)
        AtmEnviron_NewPhenologyOLD.R – Older version of environmental calculations for PnET-CN, with less data saved to data structure (for new phenology routine)
        AtmEnviron.R - Environmental calculations for PnET-CN (for old phenology routine)
        CalculateYr_NewPhenology.R - Calculate yearly output values for PnET-CN (for new phenology routine)
        CNTrans.R - Carbon and nitrogen translocation routine for PnET-CN
        Decomp.R - Decompositin routine for PnET-CN
        GrowthInit_NewPhenology.R – Initialize annual aggregation variables for each year in PnET-CN
        initvars.R - Initialize internal shared variable structures for PnET-CN
        Leach.R - Leaching routine for PnET-CN
        PhenologyNew.R - Functions to calculate phenology for PnET-CN (new phenology routine)
        PhenologyNew2.R - Skeleton code for new functions to calculate phenology for PnET-CN that would use alternative (e.g., water-driven) phenology cues for grasslands (new phenology routine - INCOMPLETE)
        Phenology.R - Functions to calculate phenology for PnET-CN (old phenology routine)
        Photosyn.R - Photosynthesis routine for PnET-CN
        SoilResp.R - Soil respiration routine for PnET-CN
        storeoutput.R - Adds variable values to the returned output structure so that the user may work with them (or save them) at the command line after running PnET-CN
        StoreYrOutput.R - Routine to save annual results to an output file for PnET-CN (not used)
        Waterbal.R - Ecosystem water balance routine for PnET-CN
        ...
    
  16. o

    Copernicus Digital Elevation Model (DEM) for Europe at 100 meter resolution...

    • data.opendatascience.eu
    • data.mundialis.de
    • +4more
    Updated Feb 23, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Copernicus Digital Elevation Model (DEM) for Europe at 100 meter resolution (EU-LAEA) derived from Copernicus Global 30 meter DEM dataset [Dataset]. https://data.opendatascience.eu/geonetwork/srv/search?keyword=DSM
    Explore at:
    Dataset updated
    Feb 23, 2022
    Description

    The Copernicus DEM is a Digital Surface Model (DSM) which represents the surface of the Earth including buildings, infrastructure and vegetation. The original GLO-30 provides worldwide coverage at 30 meters (refers to 10 arc seconds). Note that ocean areas do not have tiles, there one can assume height values equal to zero. Data is provided as Cloud Optimized GeoTIFFs. Note that the vertical unit for measurement of elevation height is meters. The Copernicus DEM for Europe at 100 meter resolution (EU-LAEA projection) in COG format has been derived from the Copernicus DEM GLO-30, mirrored on Open Data on AWS, dataset managed by Sinergise (https://registry.opendata.aws/copernicus-dem/). Processing steps: The original Copernicus GLO-30 DEM contains a relevant percentage of tiles with non-square pixels. We created a mosaic map in https://gdal.org/drivers/raster/vrt.html format and defined within the VRT file the rule to apply cubic resampling while reading the data, i.e. importing them into GRASS GIS for further processing. We chose cubic instead of bilinear resampling since the height-width ratio of non-square pixels is up to 1:5. Hence, artefacts between adjacent tiles in rugged terrain could be minimized: gdalbuildvrt -input_file_list list_geotiffs_MOOD.csv -r cubic -tr 0.000277777777777778 0.000277777777777778 Copernicus_DSM_30m_MOOD.vrt In order to reproject the data to EU-LAEA projection while reducing the spatial resolution to 100 m, bilinear resampling was performed in GRASS GIS (using r.proj) and the pixel values were scaled with 1000 (storing the pixels as Integer values) for data volume reduction. In addition, a hillshade raster map was derived from the resampled elevation map (using r.relief GRASS GIS). Eventually, we exported the elevation and hillshade raster maps in Cloud Optimized GeoTIFF (COG) format, along with SLD and QML style files.

  17. l

    LSC (Leicester Scientific Corpus)

    • figshare.le.ac.uk
    Updated Apr 15, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neslihan Suzen (2020). LSC (Leicester Scientific Corpus) [Dataset]. http://doi.org/10.25392/leicester.data.9449639.v2
    Explore at:
    Dataset updated
    Apr 15, 2020
    Dataset provided by
    University of Leicester
    Authors
    Neslihan Suzen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Leicester
    Description

    The LSC (Leicester Scientific Corpus)

    April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk) Supervised by Prof Alexander Gorban and Dr Evgeny MirkesThe data are extracted from the Web of Science [1]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.[Version 2] A further cleaning is applied in Data Processing for LSC Abstracts in Version 1*. Details of cleaning procedure are explained in Step 6.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v1.Getting StartedThis text provides the information on the LSC (Leicester Scientific Corpus) and pre-processing steps on abstracts, and describes the structure of files to organise the corpus. This corpus is created to be used in future work on the quantification of the meaning of research texts and make it available for use in Natural Language Processing projects.LSC is a collection of abstracts of articles and proceeding papers published in 2014, and indexed by the Web of Science (WoS) database [1]. The corpus contains only documents in English. Each document in the corpus contains the following parts:1. Authors: The list of authors of the paper2. Title: The title of the paper 3. Abstract: The abstract of the paper 4. Categories: One or more category from the list of categories [2]. Full list of categories is presented in file ‘List_of _Categories.txt’. 5. Research Areas: One or more research area from the list of research areas [3]. Full list of research areas is presented in file ‘List_of_Research_Areas.txt’. 6. Total Times cited: The number of times the paper was cited by other items from all databases within Web of Science platform [4] 7. Times cited in Core Collection: The total number of times the paper was cited by other papers within the WoS Core Collection [4]The corpus was collected in July 2018 online and contains the number of citations from publication date to July 2018. We describe a document as the collection of information (about a paper) listed above. The total number of documents in LSC is 1,673,350.Data ProcessingStep 1: Downloading of the Data Online

    The dataset is collected manually by exporting documents as Tab-delimitated files online. All documents are available online.Step 2: Importing the Dataset to R

    The LSC was collected as TXT files. All documents are extracted to R.Step 3: Cleaning the Data from Documents with Empty Abstract or without CategoryAs our research is based on the analysis of abstracts and categories, all documents with empty abstracts and documents without categories are removed.Step 4: Identification and Correction of Concatenate Words in AbstractsEspecially medicine-related publications use ‘structured abstracts’. Such type of abstracts are divided into sections with distinct headings such as introduction, aim, objective, method, result, conclusion etc. Used tool for extracting abstracts leads concatenate words of section headings with the first word of the section. For instance, we observe words such as ConclusionHigher and ConclusionsRT etc. The detection and identification of such words is done by sampling of medicine-related publications with human intervention. Detected concatenate words are split into two words. For instance, the word ‘ConclusionHigher’ is split into ‘Conclusion’ and ‘Higher’.The section headings in such abstracts are listed below:

    Background Method(s) Design Theoretical Measurement(s) Location Aim(s) Methodology Process Abstract Population Approach Objective(s) Purpose(s) Subject(s) Introduction Implication(s) Patient(s) Procedure(s) Hypothesis Measure(s) Setting(s) Limitation(s) Discussion Conclusion(s) Result(s) Finding(s) Material (s) Rationale(s) Implications for health and nursing policyStep 5: Extracting (Sub-setting) the Data Based on Lengths of AbstractsAfter correction, the lengths of abstracts are calculated. ‘Length’ indicates the total number of words in the text, calculated by the same rule as for Microsoft Word ‘word count’ [5].According to APA style manual [6], an abstract should contain between 150 to 250 words. In LSC, we decided to limit length of abstracts from 30 to 500 words in order to study documents with abstracts of typical length ranges and to avoid the effect of the length to the analysis.

    Step 6: [Version 2] Cleaning Copyright Notices, Permission polices, Journal Names and Conference Names from LSC Abstracts in Version 1Publications can include a footer of copyright notice, permission policy, journal name, licence, author’s right or conference name below the text of abstract by conferences and journals. Used tool for extracting and processing abstracts in WoS database leads to attached such footers to the text. For example, our casual observation yields that copyright notices such as ‘Published by Elsevier ltd.’ is placed in many texts. To avoid abnormal appearances of words in further analysis of words such as bias in frequency calculation, we performed a cleaning procedure on such sentences and phrases in abstracts of LSC version 1. We removed copyright notices, names of conferences, names of journals, authors’ rights, licenses and permission policies identified by sampling of abstracts.Step 7: [Version 2] Re-extracting (Sub-setting) the Data Based on Lengths of AbstractsThe cleaning procedure described in previous step leaded to some abstracts having less than our minimum length criteria (30 words). 474 texts were removed.Step 8: Saving the Dataset into CSV FormatDocuments are saved into 34 CSV files. In CSV files, the information is organised with one record on each line and parts of abstract, title, list of authors, list of categories, list of research areas, and times cited is recorded in fields.To access the LSC for research purposes, please email to ns433@le.ac.uk.References[1]Web of Science. (15 July). Available: https://apps.webofknowledge.com/ [2]WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [3]Research Areas in WoS. Available: https://images.webofknowledge.com/images/help/WOS/hp_research_areas_easca.html [4]Times Cited in WoS Core Collection. (15 July). Available: https://support.clarivate.com/ScientificandAcademicResearch/s/article/Web-of-Science-Times-Cited-accessibility-and-variation?language=en_US [5]Word Count. Available: https://support.office.com/en-us/article/show-word-count-3c9e6a11-a04d-43b4-977c-563a0e0d5da3 [6]A. P. Association, Publication manual. American Psychological Association Washington, DC, 1983.

  18. RUNNING"calorie:heartrate

    • kaggle.com
    zip
    Updated Jan 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    romechris34 (2022). RUNNING"calorie:heartrate [Dataset]. https://www.kaggle.com/datasets/romechris34/wellness
    Explore at:
    zip(25272804 bytes)Available download formats
    Dataset updated
    Jan 6, 2022
    Authors
    romechris34
    Description

    title: 'BellaBeat Fitbit' author: 'C Romero' date: 'r Sys.Date()' output: html_document: number_sections: true

    toc: true

    ##Installation of the base package for data analysis tool
    install.packages("base")
    
    ##Installation of the ggplot2 package for data analysis tool
    install.packages("ggplot2")
    
    ##install Lubridate is an R package that makes it easier to work with dates and times.
    install.packages("lubridate")
    ```{r}
    
    ##Installation of the tidyverse package for data analysis tool
    install.packages("tidyverse")
    
    ##Installation of the tidyr package for data analysis tool
    install.packages("dplyr")
    
    ##Installation of the readr package for data analysis tool
    install.packages("readr")
    
    ##Installation of the tidyr package for data analysis tool
    install.packages("tidyr")
    

    Importing packages

    metapackage of all tidyverse packages

    library(base) library(lubridate)# make dealing with dates a little easier library(ggplot2)# create elegant data visialtions using the grammar of graphics library(dplyr)# a grammar of data manpulation library(readr)# read rectangular data text library(tidyr)

    
    ## Running code
    
    In a notebook, you can run a single code cell by clicking in the cell and then hitting 
    the blue arrow to the left, or by clicking in the cell and pressing Shift+Enter. In a script, 
    you can run code by highlighting the code you want to run and then clicking the blue arrow
    at the bottom of this window.
    
    ## Reading in files
    
    
    ```{r}
    list.files(path = "../input")
    
    # load the activity and sleep data set
    ```{r}
    dailyActivity <- read_csv("../input/wellness/dailyActivity_merge.csv")
    sleepDay <- read_csv("../input/wellness/sleepDay_merged.csv")
    
    

    check for duplicates and na

    sum(duplicated(dailyActivity)) sum(duplicated(sleepDay)) sum(is.na(dailyActivity)) sum(is.na(sleepDay))

    now we will remove duplicate from sleep & create new dataframe

    sleepy <- sleepDay %>% distinct() head(sleepy) head(dailyActivity)

    count number of id's total sleepy & dailyActivity frames

    n_distinct(dailyActivity$Id) n_distinct(sleepy$Id)

    get total sum steps for each member id

    dailyActivity %>% group_by(Id) %>% summarise(freq = sum(TotalSteps)) %>% arrange(-freq) Tot_dist <- dailyActivity %>% mutate(Id = as.character(dailyActivity$Id)) %>% group_by(Id) %>% summarise(dizzy = sum(TotalDistance)) %>% arrange(-dizzy)

    now get total min sleep & lie in bed

    sleepy %>% group_by(Id) %>% summarise(Msleep = sum(TotalMinutesAsleep)) %>% arrange(Msleep) sleepy %>% group_by(Id) %>% summarise(inBed = sum(TotalTimeInBed)) %>% arrange(inBed)

    plot graph for "inbed and sleep data" & "total steps and distance"

    ggplot(Tot_dist) + 
     geom_count(mapping = aes(y= dizzy, x= Id, color = Id, fill = Id, size = 2)) +
     labs(x = "member id's", title = "distance miles" ) +
     theme(axis.text.x = element_text(angle = 90)) 
     ```
    
  19. Yahoo Finance Dataset (2018-2023)

    • kaggle.com
    zip
    Updated May 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Suruchi Arora (2023). Yahoo Finance Dataset (2018-2023) [Dataset]. https://www.kaggle.com/datasets/suruchiarora/yahoo-finance-dataset-2018-2023
    Explore at:
    zip(79394 bytes)Available download formats
    Dataset updated
    May 9, 2023
    Authors
    Suruchi Arora
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    The "yahoo_finance_dataset(2018-2023)" dataset is a financial dataset containing daily stock market data for multiple assets such as equities, ETFs, and indexes. It spans from April 1, 2018 to March 31, 2023, and contains 1257 rows and 7 columns. The data was sourced from Yahoo Finance, and the purpose of the dataset is to provide researchers, analysts, and investors with a comprehensive dataset that they can use to analyze stock market trends, identify patterns, and develop investment strategies. The dataset can be used for various tasks, including stock price prediction, trend analysis, portfolio optimization, and risk management. The dataset is provided in XLSX format, which makes it easy to import into various data analysis tools, including Python, R, and Excel.

    The dataset includes the following columns:

    Date: The date on which the stock market data was recorded. Open: The opening price of the asset on the given date. High: The highest price of the asset on the given date. Low: The lowest price of the asset on the given date. Close*: The closing price of the asset on the given date. Note that this price does not take into account any after-hours trading that may have occurred after the market officially closed. Adj Close**: The adjusted closing price of the asset on the given date. This price takes into account any dividends, stock splits, or other corporate actions that may have occurred, which can affect the stock price. Volume: The total number of shares of the asset that were traded on the given date.

  20. o

    COPERNICUS Digital Elevation Model (DEM) for Europe at 30 meter resolution...

    • data.opendatascience.eu
    • data.mundialis.de
    • +1more
    Updated May 24, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). COPERNICUS Digital Elevation Model (DEM) for Europe at 30 meter resolution (EU-LAEA) derived from Copernicus Global 30 meter dataset [Dataset]. https://data.opendatascience.eu/geonetwork/srv/search?format=Cloud%20Optimized%20GeoTIFF
    Explore at:
    Dataset updated
    May 24, 2022
    Description

    The Copernicus DEM is a Digital Surface Model (DSM) which represents the surface of the Earth including buildings, infrastructure and vegetation. The original GLO-30 provides worldwide coverage at 30 meters (refers to 10 arc seconds). Note that ocean areas do not have tiles, there one can assume height values equal to zero. Data is provided as Cloud Optimized GeoTIFFs. Note that the vertical unit for measurement of elevation height is meters. The Copernicus DEM for Europe at 30 meter resolution (EU-LAEA projection) in COG format has been derived from the Copernicus DEM GLO-30, mirrored on Open Data on AWS, dataset managed by Sinergise (https://registry.opendata.aws/copernicus-dem/). Processing steps: The original Copernicus GLO-30 DEM contains a relevant percentage of tiles with non-square pixels. We created a mosaic map in https://gdal.org/drivers/raster/vrt.html format and defined within the VRT file the rule to apply cubic resampling while reading the data, i.e. importing them into GRASS GIS for further processing. We chose cubic instead of bilinear resampling since the height-width ratio of non-square pixels is up to 1:5. Hence, artefacts between adjacent tiles in rugged terrain could be minimized: gdalbuildvrt -input_file_list list_geotiffs_MOOD.csv -r cubic -tr 0.000277777777777778 0.000277777777777778 Copernicus_DSM_30m_MOOD.vrt In order to reproject the data to EU-LAEA projection, bilinear resampling was performed in GRASS GIS (using r.proj) and the pixel values were scaled with 1000 (storing the pixels as Integer values) for data volume reduction. In addition, a hillshade raster map was derived from the resampled elevation map (using r.relief, GRASS GIS). Eventually, we exported the elevation and hillshade raster maps in Cloud Optimized GeoTIFF (COG) format, along with SLD and QML style files. Note that GLO-30 Public provides limited coverage at 30 meters because a small subset of tiles covering specific countries are not yet released to the public by the Copernicus Programme. Note that ocean areas do not have tiles, there one can assume height values equal to zero. Data is provided as Cloud Optimized GeoTIFFs.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Seair Exim Solutions (2024). Tomato Import Data | F And R Importing Company [Dataset]. https://www.seair.co.in/us-import/product-tomato/i-f-and-r-importing-company.aspx

Tomato Import Data | F And R Importing Company

Explore at:
.text/.csv/.xml/.xls/.binAvailable download formats
Dataset updated
Feb 25, 2024
Dataset authored and provided by
Seair Exim Solutions
Description

Explore detailed Tomato import data of F And R Importing Company in the USA—product details, price, quantity, origin countries, and US ports.

Search
Clear search
Close search
Google apps
Main menu