10 datasets found

Supplement 1. R code for estimating thresholds while accounting for variable...

wiley.figshare.com

html

Updated Jun 2, 2023

Facebook

Twitter

Click to copy link

Link copied

Cite

Jay E. Jones; Andrew J. Kroll; Jack Giovanini; Steven D. Duke; Matthew G. Betts (2023). Supplement 1. R code for estimating thresholds while accounting for variable detection and data for estimating thresholds for forest birds, Oregon, USA, 2007–2008. [Dataset]. http://doi.org/10.6084/m9.figshare.3552231.v1

Explore at:

htmlAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.3552231.v1

Dataset updated

Jun 2, 2023

Dataset provided by

Wiley

Authors

Jay E. Jones; Andrew J. Kroll; Jack Giovanini; Steven D. Duke; Matthew G. Betts

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Area covered

United States, Oregon

Description

File List Supplement_Avian data.csv Supplement_R code.r Description The Supplement_Avian data.csv file contains data on stand-level habitat covariates and visit-specific detections of avian species, Oregon, USA, 2008–2009. Column definitions

    Stand id
    Percent cover of conifer species
    Percent cover of broadleaf species
    Percent cover of deciduous broadleaf species
    Percent cover of hardwood species
    Percent cover of hardwood species in a 2000 m radius circle around each sample stand
    Elevation (m) of stand
    Age of stand
    Year of sampling
    Visit number
    Detection of Magnolia Warbler on Visit 1
    Detection of Magnolia Warbler on Visit 2
    Detection of Orange-crowned Warbler on Visit 1
    Detection of Orange-crowned Warbler on Visit 2
    Detection of Swainson’s Thrush on Visit 1
    Detection of Swainson’s Thrush on Visit 2
    Detection of Willow Flycatcher on Visit 1
    Detection of Willow Flycatcher on Visit 2
    Detection of Wilson’s Warbler on Visit 1
    Detection of Wilson’s Warbler on Visit 1

  Checksum values are:

    Column 2 (Percent cover of conifer species – CONIFER): SUM = 5862.83
    Column 3 (Percent cover of broadleaf species – BROAD): SUM = 7043.17
    Column 4 (Percent cover of deciduous broadleaf species – DECBROAD): SUM = 5475.17
    Column 5 (Percent cover of hardwood species – HARDWOOD): SUM = 2151.96
    Column 6 (Percent cover of hardwood species in a 2000 m radius circle around each sample stand– HWD2000): SUM = 3486.07
    Column 7 (Stand elevation – ELEVM): SUM = 83240.58
    Column 8 (Stand age – AGE): SUM = 1537; NA indicates a stand was harvested in 2008
    Column 9 (Year of sampling – YEAR): SUM = 425792
    Column 11 (MGWA.1): SUM = 70
    Column 12 (MGWA.2): SUM = 71
    Column 13 (OCWA.1): SUM = 121
    Column 14 (OCWA.2): SUM = 76
    Column 15 (SWTH.1): SUM = 90
    Column 16 (SWTH.2): SUM = 95
    Column 17 (WIFL.1): SUM = 85
    Column 18 (WIFL.2): SUM = 85
    Column 19 (WIWA.1): SUM = 36
    Column 20 (WIWA.2): SUM = 37

  The Supplement_R code.r file is R source code for simulation and empirical analyses conducted in Jones et al.

i
Matrix profile analysis of Dansgaard-Oeschger events in palaeoclimate time...
rdm.inesctec.pt
Updated Feb 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Matrix profile analysis of Dansgaard-Oeschger events in palaeoclimate time series - Dataset - CKAN [Dataset]. https://rdm.inesctec.pt/dataset/cs-2024-002
Explore at:
Dataset updated
Feb 6, 2024
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This dataset includes all the datafiles and computational notebooks required to reproduce the work reported in the paper “Characterisation of Dansgaard-Oeschger events in palaeoclimate time series using the Matrix Profile”: Input datafiles time series (20-years resolution) of oxygen isotope ratios (δ18O) from NGRIP ice core on the GICC05 time scale (source: https://www.iceandclimate.nbi.ku.dk, DOI: 10.1016/j.quascirev.2014.09.007): the 1st columns is the time in ka (10³ years) b2k (before A.D. 2000), and the 2nd column the oxygen isotope concentration; time series (20-years resolution) of calcium concentration (Ca2+) from NGRIP ice core on the GICC05 time scale (source: https://www.iceandclimate.nbi.ku.dk, DOI: 10.1016/j.quascirev.2014.09.007): the 1st columns is the time in ka (10³ years) b2k (before A.D. 2000), and the 2nd column the Ca2+ concentration; time series (20-years resolution) of calcium concentration (Ca2+) from NGRIP ice core on the GICC05 times scale, artificially shifted by 10 ka (500 data points): the 1st columns is the time in ka (10³ years) b2k (before A.D. 2000), and the 2nd column the Ca2+ concentration; time series (20-years resolution) of calcium concentration (Ca2+) from NGRIP ice core on the GICC05 times scale, trimmed by 10 ka (500 data points): the 1st columns is the time in ka (10³ years) b2k (before A.D. 2000), and the 2nd column the Ca2+ concentration; Code and computational notebooks R code for visualisation of matrix profile calculations; jupyter notebook (python) containing the matrix profile analysis of the oxygen isotope time series; jupyter notebook (python) containing the matrix profile analysis of the calcium time series; jupyter notebook (python) containing the join matrix profile analysis of oxygen isotope and calcium time series; jupyter notebook (R) for visualisation of matrix profile results of the oxygen isotope time series; jupyter notebook (R) for visualisation of matrix profile results of the calcium time series; jupyter notebook (R) for visualisation of join matrix profile results; Output datafiles matrix profile of the oxygen isotope time series (sub-sequence length of 2,500 years): the 1st column contains the matrix profile value (distance to the nearest sub-sequence), the 2nd column contains the profile index (the zero-based index location of the nearest sub-sequence);
Uniform Crime Reporting (UCR) Program Data: Arrests by Age, Sex, and Race,...
search.datacite.org
openicpsr.org
Updated 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacob Kaplan (2018). Uniform Crime Reporting (UCR) Program Data: Arrests by Age, Sex, and Race, 1980-2016 [Dataset]. http://doi.org/10.3886/e102263v5-10021
Explore at:
Unique identifier
https://doi.org/10.3886/e102263v5-10021
Dataset updated
2018
Dataset provided by
Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
DataCitehttps://www.datacite.org/
Authors
Jacob Kaplan
Description
Version 5 release notes:
Removes support for SPSS and Excel data.Changes the crimes that are stored in each file. There are more files now with fewer crimes per file. The files and their included crimes have been updated below.
Adds in agencies that report 0 months of the year.Adds a column that indicates the number of months reported. This is generated summing up the number of unique months an agency reports data for. Note that this indicates the number of months an agency reported arrests for ANY crime. They may not necessarily report every crime every month. Agencies that did not report a crime with have a value of NA for every arrest column for that crime.Removes data on runaways.
Version 4 release notes:
Changes column names from "poss_coke" and "sale_coke" to "poss_heroin_coke" and "sale_heroin_coke" to clearly indicate that these column includes the sale of heroin as well as similar opiates such as morphine, codeine, and opium. Also changes column names for the narcotic columns to indicate that they are only for synthetic narcotics.
Version 3 release notes:
Add data for 2016.Order rows by year (descending) and ORI.Version 2 release notes:
Fix bug where Philadelphia Police Department had incorrect FIPS county code.
The Arrests by Age, Sex, and Race data is an FBI data set that is part of the annual Uniform Crime Reporting (UCR) Program data. This data contains highly granular data on the number of people arrested for a variety of crimes (see below for a full list of included crimes). The data sets here combine data from the years 1980-2015 into a single file. These files are quite large and may take some time to load.
All the data was downloaded from NACJD as ASCII+SPSS Setup files and read into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. For the R code used to clean this data, see here. https://github.com/jacobkap/crime_data. If you have any questions, comments, or suggestions please contact me at jkkaplan6@gmail.com.

I did not make any changes to the data other than the following. When an arrest column has a value of "None/not reported", I change that value to zero. This makes the (possible incorrect) assumption that these values represent zero crimes reported. The original data does not have a value when the agency reports zero arrests other than "None/not reported." In other words, this data does not differentiate between real zeros and missing values. Some agencies also incorrectly report the following numbers of arrests which I change to NA: 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 99999, 99998.

To reduce file size and make the data more manageable, all of the data is aggregated yearly. All of the data is in agency-year units such that every row indicates an agency in a given year. Columns are crime-arrest category units. For example, If you choose the data set that includes murder, you would have rows for each agency-year and columns with the number of people arrests for murder. The ASR data breaks down arrests by age and gender (e.g. Male aged 15, Male aged 18). They also provide the number of adults or juveniles arrested by race. Because most agencies and years do not report the arrestee's ethnicity (Hispanic or not Hispanic) or juvenile outcomes (e.g. referred to adult court, referred to welfare agency), I do not include these columns.

To make it easier to merge with other data, I merged this data with the Law Enforcement Agency Identifiers Crosswalk (LEAIC) data. The data from the LEAIC add FIPS (state, county, and place) and agency type/subtype. Please note that some of the FIPS codes have leading zeros and if you open it in Excel it will automatically delete those leading zeros.

I created 9 arrest categories myself. The categories are:
Total Male JuvenileTotal Female JuvenileTotal Male AdultTotal Female AdultTotal MaleTotal FemaleTotal JuvenileTotal AdultTotal ArrestsAll of these categories are based on the sums of the sex-age categories (e.g. Male under 10, Female aged 22) rather than using the provided age-race categories (e.g. adult Black, juvenile Asian). As not all agencies report the race data, my method is more accurate. These categories also make up the data in the "simple" version of the data. The "simple" file only includes the above 9 columns as the arrest data (all other columns in the data are just agency identifier columns). Because this "simple" data set need fewer columns, I include all offenses.

As the arrest data is very granular, and each category of arrest is its own column, there are dozens of columns per crime. To keep the data somewhat manageable, there are nine different files, eight which contain different crimes and the "simple" file. Each file contains the data for all years. The eight categories each have crimes belonging to a major crime category and do not overlap in crimes other than with the index offenses. Please note that the crime names provided below are not the same as the column names in the data. Due to Stata limiting column names to 32 characters maximum, I have abbreviated the crime names in the data. The files and their included crimes are:

Index Crimes
MurderRapeRobberyAggravated AssaultBurglaryTheftMotor Vehicle TheftArsonAlcohol CrimesDUIDrunkenness
LiquorDrug CrimesTotal DrugTotal Drug SalesTotal Drug PossessionCannabis PossessionCannabis SalesHeroin or Cocaine PossessionHeroin or Cocaine SalesOther Drug PossessionOther Drug SalesSynthetic Narcotic PossessionSynthetic Narcotic SalesGrey Collar and Property CrimesForgeryFraudStolen PropertyFinancial CrimesEmbezzlementTotal GamblingOther GamblingBookmakingNumbers LotterySex or Family CrimesOffenses Against the Family and Children
Other Sex Offenses
ProstitutionRapeViolent CrimesAggravated AssaultMurderNegligent ManslaughterRobberyWeapon Offenses
Other CrimesCurfewDisorderly ConductOther Non-trafficSuspicion
VandalismVagrancy
Simple
This data set has every crime and only the arrest categories that I created (see above).
If you have any questions, comments, or suggestions please contact me at jkkaplan6@gmail.com.
Z
Data and scripts for the analysis of the influence of crop pollinator...
data.niaid.nih.gov
Updated Aug 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kitzberger, Thomas (2023). Data and scripts for the analysis of the influence of crop pollinator dependence and growth form on yield decline [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7863824
Explore at:
Dataset updated
Aug 8, 2023
Dataset provided by
Kitzberger, Thomas
Gleiser, Gabriela
Milla, Rubén
Aizen, Marcelo Adrián
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Marcelo A. Aizen, Gabriela R. Gleiser, Thomas Kitzberger, Ruben Milla. Being a tree crop increases the odds of experiencing yield declines irrespective of pollinator dependence (to be submitted to PCI)

Data and R scripts to reproduce the analyses and the figures shown in the paper. All analyses were performed using R 4.0.2.

Data

FAOdata_21-12-2021.csv

This file includes yearly data (1961-2020, column 8) on yield and cultivated area (columns 6 and 10) at the country, sub-regional, and regional levels (column 2) for each crop (column 4) drawn from the United Nations Food and Agriculture Organization database (data available at http://www.fao.org/faostat/en; accessed July 21-12-2021). [Used in Script 1 to generate the synthesis dataset]

countries.csv

This file provides information on the region (column 2) to which each country (column 1) belongs. [Used in Script 1 to generate the synthesis dataset]

dependence.csv

This file provides information on the pollinator dependence category (column 2) of each crop (column 1).

traits.csv

This file provides information on the traits of each crop other than pollinator dependence, including, besides the crop name (column1), the variables type of harvested organ (column 5) and growth form (column 6). [Used in Script 1 to generate the synthesis dataset]

dataset.csv

The synthesis dataset generated by Script 1.

growth.csv

The yield growth dataset generated by Script 1 and used as input by Scripts 2 and 3.

phylonames.csv

This file lists all the crops (column 1) and their equivalent tip names in the crop phylogeny (column 2). [Used in Script 2 for the phylogenetically-controlled analyses]

8.phylo137.tre

File containing the phylogenetic tree.

Scripts

dataset

This R script curates and merges all the individual datasets mentioned above into a single dataset, estimating and adding to this single dataset the growth rate for each crop and country, and the (log) cumulative harvested area per crop and country over the period 1961-2020.

analyses

This R script includes all the analyses described in the article’s main text.

figures

This R script creates all the main and supplementary figures of this article.

lme4_phylo_setup

R function written by Li and Bolker (2019) to carry out phylogenetically-controlled generalized linear mixed-effects models as described in the main text of the article.

References

Li, M., and B. Bolker. 2019. wzmli/phyloglmm: First release of phylogenetic comparative analysis in lme4- verse. Zenodo. https://doi.org/10.5281/zenodo.2639887.
r
Inequality measures based on election data 1871 and 1892 for Swedish...
demo.researchdata.se
researchdata.se
Updated Apr 30, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sara Moricz (2019). Inequality measures based on election data 1871 and 1892 for Swedish municipalities [Dataset]. http://doi.org/10.5878/cw7b-g897
Explore at:
Unique identifier
https://doi.org/10.5878/cw7b-g897
Dataset updated
Apr 30, 2019
Dataset provided by
Lund University
Authors
Sara Moricz
Time period covered
1871
Area covered
Sverige
Description
The data contains inequality measures at the municipality-level for 1892 and 1871, as estimated in the PhD thesis "Institutions, Inequality and Societal Transformations" by Sara Moricz. The data also contains the source publications: 1) tabel 1 from “Bidrag till Sverige official statistik R) Valstatistik. XI. Statistiska Centralbyråns underdåniga berättelse rörande kommunala rösträtten år 1892” (biSOS R 1892) 2) tabel 1 from “Bidrag till Sverige official statistik R) Valstatistik. II. Statistiska Centralbyråns underdåniga berättelse rörande kommunala rösträtten år 1871” (biSOS R 1871)

moricz_inequality_agriculture.csv

A UTF-8 encoded .csv-file. Each row is a municipality of the agricultural sample (2222 in total). Each column is a variable.

R71muncipality_id: a unique identifier for the municipalities in the R1871 publication (the municipality name can be obtained from the source data) R92muncipality_id: a unique identifier for the municipalities in the R1892 publication (the municipality name can be obtained from the source data) agriTop1_1871: an ordinal measure (ranking) of the top 1 income share in the agricultural sector for 1871 agriTop1_1892: an ordinal measure (ranking) of the top 1 income share in the agricultural sector for 1892 highestFarm_1871: a cardinal measure of the top 1 person share in the agricultural sector for 1871 highestFarm_1871: a cardinal measure of the top 1 person share in the agricultural sector for 1892

moricz_inequality_industry.csv

A UTF-8 encoded .csv-file. Each row is a municipality of the industrial sample (1328 in total). Each column is a variable.

R71muncipality_id: see above description R92muncipality_id: see above description indTop1_1871: an ordinal measure (ranking) of the top 1 income share in the industrial sector for 1871 indTop1_1892: an ordinal measure (ranking) of the top 1 income share in the industrial sector for 1892

moricz_R1892_source_data.csv

A UTF-8 encoded .csv-file with the source data. The variables are described in the adherent codebook moricz_R1892_source_data_codebook.csv.

Contains table 1 from “Bidrag till Sverige official statistik R) Valstatistik. XI. Statistiska Centralbyråns underdåniga berättelse rörande kommunala rösträtten år 1892” (biSOS R 1892). SCB provides the scanned publication on their website. Dollar Typing Service typed and delivered the data in 2015. All numerical variables but two have been checked. This is easy to do since nearly all columns should sum up to another column. For “Folkmangd” (population) the numbers have been corrected against U1892. The highest estimate of errors in the variables is 0.005 percent (0.5 promille), calculated at cell level. The two numerical variables which have not been checked is “hogsta_fyrk_jo“ and “hogsta_fyrk_ov“, as this cannot much be compared internally in the data. According to my calculations as the worst case scenario, I have measurement errors of 0.0043 percent (0.43 promille) in those variables.

moricz_R1871_source_data.csv

A UTF-8 encoded .csv-file with the source data. The variables are described in the adherent codebook moricz_R1871_source_data_codebook.csv.

Contains table 1 from “Bidrag till Sverige official statistik R) Valstatistik. II. Statistiska Centralbyråns underdåniga berättelse rörande kommunala rösträtten år 1871” (biSOS R 1871). SCB provides the scanned publication on their website. Dollar Typing Service typed and delivered the data in 2015. The variables have been checked for accuracy, which is feasible since columns and rows should sum. The variables that most likely carry mistakes are “hogsta_fyrk_al” and “hogsta_fyrk_jo”.
g
First quarter 2024 / Table BOAMP-SIREN-BUYERS (BSA): a cross between the...
gimi9.com
Updated Feb 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). First quarter 2024 / Table BOAMP-SIREN-BUYERS (BSA): a cross between the BOAMP table (DILA) and the Sirene Business Base (INSEE) | gimi9.com [Dataset]. https://gimi9.com/dataset/eu_6644ac7663969d80f6047dd8/
Explore at:
Dataset updated
Feb 13, 2025
Description
Crossing table of the BOAMP table (DILA) with the Sirene Business Base (INSEE) / First Quarter 2024. - The BUYER's Siren number (column "SN_30_Siren") is implemented for each ad (column and primary key "B_17_idweb"); - Several columns facilitating datamining have been added; - The names of the original columns have been prefixed, numbered and sorted alphabetically. ---- You will find here - The BSA for the first quarter of 2024 in free and open access (csv/separator ** semicolon** formats, and Public Prosecutor’s Office); - The schema of the BSA table (csv/comma separator format); - An excerpt from the March 30 BSA (csv/comma separator format) to quickly give you an idea of the Datagouv explorer. NB / The March 30 extract sees its columns of cells in json GESTION, DATA, and ANNONCES_ANTERIEURES purged. The data thus deleted can be found in a nicer format by following the links of the added columns: - B_41_GESTION_URL_JSON; - B_43_DONNEES_URL_JSON; - B_45_ANNONCES_ANTERIEURES_URL_JSON. ---- More info - Daily and paid updates on the entire BOAMP 2024 are available on our website under ► AuFilDuBoamp Downloads; - Further documentation can be found at ► AuFilDuBoamp Doc & TP. ---- Data sources - SIRENE database of companies and their establishments (SIREN, SIRET) of August - BOAMP API ---- To download the first quarter of the BSA with Python, run: For the CSV: df = pd.read_csv("https://www.data.gouv.fr/en/datasets/r/63f0d792-148a-4c95-a0b6-9e8ea8b0b34a", dtype='string', sep=';') For the Public Prosecutor's Office: df = pd.read_parquet("https://www.data.gouv.fr/en/datasets/r/f7a4a76e-ff50-4dc6-bae8-97368081add2") Enjoy! https://www.aufilduboamp.com/shares/aufilduboamp_docs/ap_tampon_blanc.jpg" alt="www.aufilduboamp.com and BOAMP data on datagouv" title="www.aufilduboamp.com and BOAMP data on datagouv">
e
Examples of CARE-related Activities Carried out by Repositories, in...
portal.edirepository.org
csv, pdf
Updated Mar 13, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ruth Duerr (2024). Examples of CARE-related Activities Carried out by Repositories, in Sequences or Groups [Dataset]. http://doi.org/10.6073/pasta/1b812b3bd296d23c4c7c54eb022774fc
Explore at:
pdf(63891 byte), csv(7273 byte)Available download formats
Unique identifier
https://doi.org/10.6073/pasta/1b812b3bd296d23c4c7c54eb022774fc
Dataset updated
Mar 13, 2024
Dataset provided by
EDI
Authors
Ruth Duerr
Time period covered
2020 - 2023
Variables measured
Trigger, Outreach, Technical, Repository Protocols, Situational Awareness
Description
This dataset is designed to accompany the paper submitted to Data Science Journal: O'Brien et al, "Earth Science Data Repositories: Implementing the CARE Principles". This dataset shows examples of activities that data repositories are likely to undertake as they implement the CARE principles. These examples were constructed as part of a discussion about the challenges faced by data repositories when acquiring, curating, and disseminating data and other information about Indigenous Peoples, communities, and lands. For clarity, individual repository activities were very specific. However, in practice, repository activities are not carried out singly, but are more likely to be performed in groups or in sequence. This dataset shows examples of how activities are likely to be combined in response to certain triggers. See related dataset O'Brien, M., R. Duerr, R. Taitingfong, A. Martinez, L. Vera, L. Jennings, R. Downs, E. Antognoli, T. ten Brink, N. Halmai, S.R. Carroll, D. David-Chavez, M. Hudson, and P. Buttigieg. 2024. Alignment between CARE Principles and Data Repository Activities. Environmental Data Initiative. https://doi.org/10.6073/pasta/23e699ad00f74a178031904129e78e93 (Accessed 2024-03-13), and the paper for more information about development of the activities and their categorization, raw data of relationships between specific activities and a discussion of the implementation of CARE Principles by data repositories.

Data in this table are organized into groups delineated by a triggering event in the first column. For example, the first group consists of 9 rows; while the second group has 7 rows. The first row of each group contains the event that triggers the set of actions described in the last 4 columns of the spreadsheet. Within each group, the associated rows in each column are given in numerical not temporal order, since activities will likely vary widely from repository to repository. For example, the first group of rows is about what likely needs to happen if a repository discovers that it holds Indigenous data (O6). Clearly, it will need to develop processes to identify communities to engage (R6) as well as processes for contacting those communities (R7) (if it doesn't already have them). It will also probably need to review and possibly update its data management policies to ensure that they are justifiable (R2). Based on these actions, it is likely that the repository's outreach group needs to prepare for working with more communities (O3) including ensuring that the repository's governance protocols are up-to-date and publicized (O5) and that the repository practices are transparent (O4). If initial contacts go well, it is likely that the repository will need ongoing engagement with the community or communities (S1). This may include adding representation to the repository's advisory board (O2); clarifying data usage with the communities (O9), facilitating relationships between data providers and communities (O1); working with the community to identify educational opportunities (O10); and sharing data with them (O8). It may also become necessary to liaise with whomever is maintaining the vocabularies in use at the repository (O7).
Z
Classification and Quantification of Strawberry Fruit Shape
data.niaid.nih.gov
zenodo.org
Updated Apr 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Feldmann, Mitchell J. (2020). Classification and Quantification of Strawberry Fruit Shape [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3365714
Explore at:
Dataset updated
Apr 24, 2020
Dataset authored and provided by
Feldmann, Mitchell J.
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
"Classification and Quantification of Strawberry Fruit Shape" is a dataset that includes raw RGB images and binary images of strawberry fruit. These folders contain JPEG images taken from the same experimental units on 2 different harvest dates. Images in each folder are labeled according to the 4 digit plot ID from the field experiment (####_) and the 10 digit individual ID (_##########).

"H1" and "H2" folders contain RGB images of multiple fruits. Each fruit was extracted and binarized to become the images in "H1_indiv" and "H2_indiv".

"H1_indiv" and "H2_indiv" folders contain images of individual fruit. Each fruit is bordered by ten white pixels. There are a total of 6,874 images between these two folders. The images were used then resized and scaled to be the images in "ReSized".

"ReSized" contains 6,874 binary images of individual berries. These images are all square images (1000x1000px) with the object represented by black pixels (0) and background represented with white pixels (1). Each image was scaled so that it would take up the maximum number of pixels in a 1000 x 1000px image and would maintain the aspect ratio.

"Fruit_image_data.csv" contains all of the morphometric features extracted from individual images including intermediate values.

All images title with the form "B##_NA" were discarded prior to any analyses. These images come from the buffer plots, not the experimental units of the study.

"PPKC_Figures.zip" contains all figures (F1-F7) and supplemental figures (S1-S7_ from the manuscript. Captions for the main figures are found in the manuscript. Captions for Supplemental figures are below.

Fig. S1 Results of PPKC against original cluster assignments. Ordered centroids from k = 2 to k = 8. On the left are the unordered assignments from k-means, and the on the right are the order assignments following PPKC. Cluster position indicated on the right [1, 8].

Fig. S2 Optimal Value of k. (A) Total within clusters sum of squares. (B) The inverse of the Adjusted R . (C) Akaike information criterion (AIC). (D) Bayesian information criterion (AIC). All metrics were calculated on a random sample of 3, 437 images (50%). 10 samples were randomly drawn. The vertical dashed line in each plot represents the optimal value of k. Reported metrics are standardized to be between [0, 1].

Fig. S3 Hierarchical clustering and distance between classes on PC1. The relationship between clusters at each value of k is represented as both a dendrogram and as bar plot. The labels on the dendrogram (i.e., V1, V2, V3,..., V10) represent the original cluster assignment from k-means. The barplot to the right of each dendrogram depicts the elements of the eigenvector associated with the largest eigenvalue form PPKC. The labels above each line represent the original cluster assignment.

Fig. S4 BLUPs for 13 selected features. For each plot, the X-axis is the index and the Y-axis is the BLUP value estimated from a linear mixed model. Grey points represent the mean feature value for each individual. Each point is the BLUP for a single genotype.

Fig. S5 Effects of Eigenfruit, Vertical Biomass, and Horizontal Biomass Analyses. (A) Effects of PC [1, 7] from the Eigenfruit analysis on the mean shape (center column). The left column is the mean shape minus 1.5× the standard deviation. Right is the mean shape plus 1.5× the standard deviation. The horizontal axis is the horizontal pixel position. The vertical axis is the vertical pixel position. (B) Effects of PC [1, 3] from the Horizontal Biomass analysis on the mean shape (center column). The left column is the mean shape minus 1.5× the standard deviation. Right is the mean shape plus 1.5× the standard deviation. The horizontal axis is the vertical position from the image (height). The vertical axis is the number of activated pixels (RowSum) at the given vertical position. (C) Effects of PC [1, 3] from the Vertical Biomass analysis on the mean shape (center column). The left column is the mean shape minus 1.5× the standard deviation. Right is the mean shape plus 1.5× the standard deviation. The horizontal axis is the horizontal position from the image (width). The vertical axis is the number of activated pixels (ColSum) at the given horizontal position.

Fig. S6 PPKC with variable sample size. Ordered centroids from k = 2 to k = 5 using different image sets for clustering. For all k = [2, 5], k-means clustering was performed using either 100, 80, 50%, or 20% of the total number of images; 6,874, 5, 500, 3, 437, and 1, 374 respectively. Cluster position indicated on the right [1, 5].

Fig. S7 Comparison of scale and continuous features. (A.) PPKC 4-unit ordinal scale. (B.) Distributions of the selected features with each level of k = 4 from the PPKC 4-unit ordinal scale. The light gray line is cluster 1, the medium gray line is cluster 2, the dark gray line is cluster 3, and the black line is cluster 4.
Data from: Bike Sharing Dataset
kaggle.com
Updated Sep 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ram Vishnu R (2024). Bike Sharing Dataset [Dataset]. https://www.kaggle.com/datasets/ramvishnur/bike-sharing-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 10, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ram Vishnu R
Description
Problem Statement:

A bike-sharing system is a service in which bikes are made available for shared use to individuals on a short term basis for a price or free. Many bike share systems allow people to borrow a bike from a "dock" which is usually computer-controlled wherein the user enters the payment information, and the system unlocks it. This bike can then be returned to another dock belonging to the same system.

A US bike-sharing provider BoomBikes has recently suffered considerable dip in their revenue due to the Corona pandemic. The company is finding it very difficult to sustain in the current market scenario. So, it has decided to come up with a mindful business plan to be able to accelerate its revenue.

In such an attempt, BoomBikes aspires to understand the demand for shared bikes among the people. They have planned this to prepare themselves to cater to the people's needs once the situation gets better all around and stand out from other service providers and make huge profits.

They have contracted a consulting company to understand the factors on which the demand for these shared bikes depends. Specifically, they want to understand the factors affecting the demand for these shared bikes in the American market. The company wants to know:

Which variables are significant in predicting the demand for shared bikes.

How well those variables describe the bike demands

Based on various meteorological surveys and people's styles, the service provider firm has gathered a large dataset on daily bike demands across the American market based on some factors.

Business Goal:

You are required to model the demand for shared bikes with the available independent variables. It will be used by the management to understand how exactly the demands vary with different features. They can accordingly manipulate the business strategy to meet the demand levels and meet the customer's expectations. Further, the model will be a good way for management to understand the demand dynamics of a new market.

Data Preparation:

You can observe in the dataset that some of the variables like 'weathersit' and 'season' have values as 1, 2, 3, 4 which have specific labels associated with them (as can be seen in the data dictionary). These numeric values associated with the labels may indicate that there is some order to them - which is actually not the case (Check the data dictionary and think why). So, it is advisable to convert such feature values into categorical string values before proceeding with model building. Please refer the data dictionary to get a better understanding of all the independent variables.

You might notice the column 'yr' with two values 0 and 1 indicating the years 2018 and 2019 respectively. At the first instinct, you might think it is a good idea to drop this column as it only has two values so it might not be a value-add to the model. But in reality, since these bike-sharing systems are slowly gaining popularity, the demand for these bikes is increasing every year proving that the column 'yr' might be a good variable for prediction. So think twice before dropping it.

Model Building:

In the dataset provided, you will notice that there are three columns named 'casual', 'registered', and 'cnt'. The variable 'casual' indicates the number casual users who have made a rental. The variable 'registered' on the other hand shows the total number of registered users who have made a booking on a given day. Finally, the 'cnt' variable indicates the total number of bike rentals, including both casual and registered. The model should be built taking this 'cnt' as the target variable.

Model Evaluation:

When you're done with model building and residual analysis and have made predictions on the test set, just make sure you use the following two lines of code to calculate the R-squared score on the test set. python from sklearn.metrics import r2_score r2_score(y_test, y_pred) - where y_test is the test data set for the target variable, and y_pred is the variable containing the predicted values of the target variable on the test set. - Please perform this step as the R-squared score on the test set holds as a benchmark for your model.
Z
Food and Agriculture Biomass Input–Output (FABIO) database
data.niaid.nih.gov
zenodo.org
Updated Jun 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bruckner, Martin (2022). Food and Agriculture Biomass Input–Output (FABIO) database [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2577066
Explore at:
Dataset updated
Jun 8, 2022
Dataset provided by
Kuschnig, Nikolas
Bruckner, Martin
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This data repository provides the Food and Agriculture Biomass Input Output (FABIO) database, a global set of multi-regional physical supply-use and input-output tables covering global agriculture and forestry.

The work is based on mostly freely available data from FAOSTAT, IEA, EIA, and UN Comtrade/BACI. FABIO currently covers 191 countries + RoW, 118 processes and 125 commodities (raw and processed agricultural and food products) for 1986-2013. All R codes and auxilliary data are available on GitHub. For more information please refer to https://fabio.fineprint.global.

The database consists of the following main components, in compressed .rds format:

Z: the inter-commodity input-output matrix, displaying the relationships of intermediate use of each commodity in the production of each commodity, in physical units (tons). The matrix has 24000 rows and columns (125 commodities x 192 regions), and is available in two versions, based on the method to allocate inputs to outputs in production processes: Z_mass (mass allocation) and Z_value (value allocation). Note that the row sums of the Z matrix (= total intermediate use by commodity) are identical in both versions.

Y: the final demand matrix, denoting the consumption of all 24000 commodities by destination country and final use category. There are six final use categories (yielding 192 x 6 = 1152 columns): 1) food use, 2) other use (non-food), 3) losses, 4) stock addition, 5) balancing, and 6) unspecified.

X: the total output vector of all 24000 commodities. Total output is equal to the sum of intermediate and final use by commodity.

L: the Leontief inverse, computed as (I – A)-1, where A is the matrix of input coefficients derived from Z and x. Again, there are two versions, depending on the underlying version of Z (L_mass and L_value).

E: environmental extensions for each of the 24000 commodities, including four resource categories: 1) primary biomass extraction (in tons), 2) land use (in hectares), 3) blue water use (in m3)., and 4) green water use (in m3).

mr_sup_mass/mr_sup_value: For each allocation method (mass/value), the supply table gives the physical supply quantity of each commodity by producing process, with processes in the rows (118 processes x 192 regions = 22656 rows) and commodities in columns (24000 columns).

mr_use: the use table capture the quantities of each commodity (rows) used as an input in each process (columns).

A description of the included countries and commodities (i.e. the rows and columns of the Z matrix) can be found in the auxiliary file io_codes.csv. Separate lists of the country sample (including ISO3 codes and continental grouping) and commodities (including moisture content) are given in the files regions.csv and items.csv, respectively. For information on the individual processes, see auxiliary file su_codes.csv. RDS files can be opened in R. Information on how to read these files can be obtained here: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/readRDS

Except of X.rds, which contains a matrix, all variables are organized as lists, where each element contains a sparse matrix. Please note that values are always given in physical units, i.e. tonnes or head, as specified in items.csv. The suffixes value and mass only indicate the form of allocation chosen for the construction of the symmetric IO tables (for more details see Bruckner et al. 2019). Product, process and country classifications can be found in the file fabio_classifications.xlsx.

Footprint results are not contained in the database but can be calculated, e.g. by using this script: https://github.com/martinbruckner/fabio_comparison/blob/master/R/fabio_footprints.R

How to cite:

To cite FABIO work please refer to this paper:

Bruckner, M., Wood, R., Moran, D., Kuschnig, N., Wieland, H., Maus, V., Börner, J. 2019. FABIO – The Construction of the Food and Agriculture Input–Output Model. Environmental Science & Technology 53(19), 11302–11312. DOI: 10.1021/acs.est.9b03554

License:

This data repository is distributed under the CC BY-NC-SA 4.0 License. You are free to share and adapt the material for non-commercial purposes using proper citation. If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original. In case you are interested in a collaboration, I am happy to receive enquiries at martin.bruckner@wu.ac.at.

Known issues:

The underlying FAO data have been manipulated to the minimum extent necessary. Data filling and supply-use balancing, yet, required some adaptations. These are documented in the code and are also reflected in the balancing item in the final demand matrices. For a proper use of the database, I recommend to distribute the balancing item over all other uses proportionally and to do analyses with and without balancing to illustrate uncertainties.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Supplement 1. R code for estimating thresholds while accounting for variable detection and data for estimating thresholds for forest birds, Oregon, USA, 2007–2008.

Explore at:

htmlAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.3552231.v1

Dataset updated

Jun 2, 2023

Dataset provided by

Wiley

Authors

Jay E. Jones; Andrew J. Kroll; Jack Giovanini; Steven D. Duke; Matthew G. Betts

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Area covered

United States, Oregon

Description

    Stand id
    Percent cover of conifer species
    Percent cover of broadleaf species
    Percent cover of deciduous broadleaf species
    Percent cover of hardwood species
    Percent cover of hardwood species in a 2000 m radius circle around each sample stand
    Elevation (m) of stand
    Age of stand
    Year of sampling
    Visit number
    Detection of Magnolia Warbler on Visit 1
    Detection of Magnolia Warbler on Visit 2
    Detection of Orange-crowned Warbler on Visit 1
    Detection of Orange-crowned Warbler on Visit 2
    Detection of Swainson’s Thrush on Visit 1
    Detection of Swainson’s Thrush on Visit 2
    Detection of Willow Flycatcher on Visit 1
    Detection of Willow Flycatcher on Visit 2
    Detection of Wilson’s Warbler on Visit 1
    Detection of Wilson’s Warbler on Visit 1

  Checksum values are:

    Column 2 (Percent cover of conifer species – CONIFER): SUM = 5862.83
    Column 3 (Percent cover of broadleaf species – BROAD): SUM = 7043.17
    Column 4 (Percent cover of deciduous broadleaf species – DECBROAD): SUM = 5475.17
    Column 5 (Percent cover of hardwood species – HARDWOOD): SUM = 2151.96
    Column 6 (Percent cover of hardwood species in a 2000 m radius circle around each sample stand– HWD2000): SUM = 3486.07
    Column 7 (Stand elevation – ELEVM): SUM = 83240.58
    Column 8 (Stand age – AGE): SUM = 1537; NA indicates a stand was harvested in 2008
    Column 9 (Year of sampling – YEAR): SUM = 425792
    Column 11 (MGWA.1): SUM = 70
    Column 12 (MGWA.2): SUM = 71
    Column 13 (OCWA.1): SUM = 121
    Column 14 (OCWA.2): SUM = 76
    Column 15 (SWTH.1): SUM = 90
    Column 16 (SWTH.2): SUM = 95
    Column 17 (WIFL.1): SUM = 85
    Column 18 (WIFL.2): SUM = 85
    Column 19 (WIWA.1): SUM = 36
    Column 20 (WIWA.2): SUM = 37

  The Supplement_R code.r file is R source code for simulation and empirical analyses conducted in Jones et al.

Clear search

Close search

Google apps

Main menu

Supplement 1. R code for estimating thresholds while accounting for variable...

Matrix profile analysis of Dansgaard-Oeschger events in palaeoclimate time...

Uniform Crime Reporting (UCR) Program Data: Arrests by Age, Sex, and Race,...

Data and scripts for the analysis of the influence of crop pollinator...

Inequality measures based on election data 1871 and 1892 for Swedish...

moricz_inequality_agriculture.csv

moricz_inequality_industry.csv

moricz_R1892_source_data.csv

moricz_R1871_source_data.csv

First quarter 2024 / Table BOAMP-SIREN-BUYERS (BSA): a cross between the...

Examples of CARE-related Activities Carried out by Repositories, in...

Classification and Quantification of Strawberry Fruit Shape

Data from: Bike Sharing Dataset

Problem Statement:

Business Goal:

Data Preparation:

Model Building:

Model Evaluation:

Food and Agriculture Biomass Input–Output (FABIO) database

Supplement 1. R code for estimating thresholds while accounting for variable detection and data for estimating thresholds for forest birds, Oregon, USA, 2007–2008.