25 datasets found

Petre_Slide_CategoricalScatterplotFigShare.pptx
figshare.com
pptx
Updated Sep 19, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benj Petre; Aurore Coince; Sophien Kamoun (2016). Petre_Slide_CategoricalScatterplotFigShare.pptx [Dataset]. http://doi.org/10.6084/m9.figshare.3840102.v1
Explore at:
pptxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3840102.v1
Dataset updated
Sep 19, 2016
Dataset provided by
Figsharehttp://figshare.com/
Authors
Benj Petre; Aurore Coince; Sophien Kamoun
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Categorical scatterplots with R for biologists: a step-by-step guide

Benjamin Petre1, Aurore Coince2, Sophien Kamoun1

1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK

Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.

Protocol

• Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.

• Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.

• Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.

Notes

• Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.

• Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.

7 Display the graph in a separate window. Dot colors indicate

replicates

graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()

References

Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.

Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035

Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128

https://cran.r-project.org/

http://ggplot2.org/
বাংলা সন্দেহজনক মন্তব্যের ডাটাসেট (Suspicious)
kaggle.com
zip
Updated Jul 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Meherun Nesa Shraboni (2023). বাংলা সন্দেহজনক মন্তব্যের ডাটাসেট (Suspicious) [Dataset]. https://www.kaggle.com/datasets/meherunnesashraboni/banglasuspiciouscontentdetection
Explore at:
zip(32972731 bytes)Available download formats
Dataset updated
Jul 15, 2023
Authors
Meherun Nesa Shraboni
License
https://www.worldbank.org/en/about/legal/terms-of-use-for-datasetshttps://www.worldbank.org/en/about/legal/terms-of-use-for-datasets
Description
SOURCES

It has 4 central portion, which shows Bangla text. One contains Only Bangla text(12179-unsuspicious, 7822-suspicious). Another one contains Bangla and English mixed(12725-unsuspicious, 7219-suspicious). Another one contains politically suspicious content(167-unsuspicious, 132-suspicious). Lastly, another contains the @name mentioned comment(6145-suspicious, 53855-unsuspicious). Finally, a CSV file contains all categorical Bangla Data. It contains 1,00,100+ data.

COLLECTION METHODOLOGY

Suspicious tweets- https://www.kaggle.com/datasets/syedabbasraza/suspicious-tweets Suspicious Tweets - https://www.kaggle.com/datasets/munkialbright/suspicious-tweets Suspicious Communication on Social Platforms - https://www.kaggle.com/datasets/syedabbasraza/suspicious-communication-on-social-platforms

Others are collected manually from Facebook comments. After collecting the Bangla comments, check dataset comment was understandable or not. Then step by step, each Excel file is converted into a datagram. then change the column name to the desired one('Detect' and 'Bangla Text'). I also drop some columns if needed. The files are saved in an Excel file because the CSV file can not contain Bangla text appropriately.

The 5 XLSX file are "suspicious_content(bangla)", "suspicious_content(bangla + english)", "suspicious_content(political)", "suspicious_content(including mention)" and "suspicious_content(all)". All the Excel files have only two columns, 'Detect' and 'Bangla Text'.

You will be able to see the dataset creation process in this link: https://www.kaggle.com/code/meherunnesashraboni/suspicious

suspicious

Bangla_Text

Detection

unsuspicious
PSYCHE-D: predicting change in depression severity using person-generated...
zenodo.org
data.niaid.nih.gov
bin, pdf
Updated Jul 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mariko Makhmutova; Mariko Makhmutova; Raghu Kainkaryam; Raghu Kainkaryam; Marta Ferreira; Marta Ferreira; Jae Min; Jae Min; Martin Jaggi; Martin Jaggi; Ieuan Clay; Ieuan Clay (2024). PSYCHE-D: predicting change in depression severity using person-generated health data (DATASET) [Dataset]. http://doi.org/10.5281/zenodo.5085146
Explore at:
pdf, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5085146
Dataset updated
Jul 18, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mariko Makhmutova; Mariko Makhmutova; Raghu Kainkaryam; Raghu Kainkaryam; Marta Ferreira; Marta Ferreira; Jae Min; Jae Min; Martin Jaggi; Martin Jaggi; Ieuan Clay; Ieuan Clay
Description
This dataset is made available under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). See LICENSE.pdf for details.

Dataset description

Parquet file, with:

35694 rows

154 columns

The file is indexed on [participant]_[month], such that 34_12 means month 12 from participant 34. All participant IDs have been replaced with randomly generated integers and the conversion table deleted.

Column names and explanations are included as a separate tab-delimited file. Detailed descriptions of feature engineering are available from the linked publications.

File contains aggregated, derived feature matrix describing person-generated health data (PGHD) captured as part of the DiSCover Project (https://clinicaltrials.gov/ct2/show/NCT03421223). This matrix focuses on individual changes in depression status over time, as measured by PHQ-9.

The DiSCover Project is a 1-year long longitudinal study consisting of 10,036 individuals in the United States, who wore consumer-grade wearable devices throughout the study and completed monthly surveys about their mental health and/or lifestyle changes, between January 2018 and January 2020.

The data subset used in this work comprises the following:

Wearable PGHD: step and sleep data from the participants’ consumer-grade wearable devices (Fitbit) worn throughout the study

Screener survey: prior to the study, participants self-reported socio-demographic information, as well as comorbidities

Lifestyle and medication changes (LMC) survey: every month, participants were requested to complete a brief survey reporting changes in their lifestyle and medication over the past month

Patient Health Questionnaire (PHQ-9) score: every 3 months, participants were requested to complete the PHQ-9, a 9-item questionnaire that has proven to be reliable and valid to measure depression severity

From these input sources we define a range of input features, both static (defined once, remain constant for all samples from a given participant throughout the study, e.g. demographic features) and dynamic (varying with time for a given participant, e.g. behavioral features derived from consumer-grade wearables).

The dataset contains a total of 35,694 rows for each month of data collection from the participants. We can generate 3-month long, non-overlapping, independent samples to capture changes in depression status over time with PGHD. We use the notation ‘SM0’ (sample month 0), ‘SM1’, ‘SM2’ and ‘SM3’ to refer to relative time points within each sample. Each 3-month sample consists of: PHQ-9 survey responses at SM0 and SM3, one set of screener survey responses, LMC survey responses at SM3 (as well as SM1, SM2, if available), and wearable PGHD for SM3 (and SM1, SM2, if available). The wearable PGHD includes data collected from 8 to 14 days prior to the PHQ-9 label generation date at SM3. Doing this generates a total of 10,866 samples from 4,036 unique participants.
s
Data from: GoiEner smart meters data
research.science.eus
observatorio-cientifico.ua.es
+1more
Updated 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Granja, Carlos Quesada; Hernández, Cruz Enrique Borges; Astigarraga, Leire; Merveille, Chris; Granja, Carlos Quesada; Hernández, Cruz Enrique Borges; Astigarraga, Leire; Merveille, Chris (2022). GoiEner smart meters data [Dataset]. https://research.science.eus/documentos/668fc48cb9e7c03b01be0b72
Explore at:
Dataset updated
2022
Authors
Granja, Carlos Quesada; Hernández, Cruz Enrique Borges; Astigarraga, Leire; Merveille, Chris; Granja, Carlos Quesada; Hernández, Cruz Enrique Borges; Astigarraga, Leire; Merveille, Chris
Description
Name: GoiEner smart meters data Summary: The dataset contains hourly time series of electricity consumption (kWh) provided by the Spanish electricity retailer GoiEner. The time series are arranged in four compressed files: raw.tzst, contains raw time series of all GoiEner clients (any date, any length, may have missing samples). imp-pre.tzst, contains processed time series (imputation of missing samples), longer than one year, collected before March 1, 2020. imp-in.tzst, contains processed time series (imputation of missing samples), longer than one year, collected between March 1, 2020 and May 30, 2021. imp-post.tzst, contains processed time series (imputation of missing samples), longer than one year, collected after May 30, 2020. metadata.csv, contains relevant information for each time series. License: CC-BY-SA Acknowledge: These data have been collected in the framework of the WHY project. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 891943. Disclaimer: The sole responsibility for the content of this publication lies with the authors. It does not necessarily reflect the opinion of the Executive Agency for Small and Medium-sized Enterprises (EASME) or the European Commission (EC). EASME or the EC are not responsible for any use that may be made of the information contained therein. Collection Date: From November 2, 2014 to June 8, 2022. Publication Date: December 1, 2022. DOI: 10.5281/zenodo.7362094 Other repositories: None. Author: GoiEner, University of Deusto. Objective of collection: This dataset was originally used to establish a methodology for clustering households according to their electricity consumption. Description: The meaning of each column is described next for each file. raw.tzst: (no column names provided) timestamp; electricity consumption in kWh. imp-pre.tzst, imp-in.tzst, imp-post.tzst: “timestamp”: timestamp; “kWh”: electricity consumption in kWh; “imputed”: binary value indicating whether the row has been obtained by imputation. metadata.csv: “user”: 64-character identifying a user; “start_date”: initial timestamp of the time series; “end_date”: final timestamp of the time series; “length_days”: number of days elapsed between the initial and the final timestamps; “length_years”: number of years elapsed between the initial and the final timestamps; “potential_samples”: number of samples that should be between the initial and the final timestamps of the time series if there were no missing values; “actual_samples”: number of actual samples of the time series; “missing_samples_abs”: number of potential samples minus actual samples; “missing_samples_pct”: potential samples minus actual samples as a percentage; “contract_start_date”: contract start date; “contract_end_date”: contract end date; “contracted_tariff”: type of tariff contracted (2.X: households and SMEs, 3.X: SMEs with high consumption, 6.X: industries, large commercial areas, and farms); “self_consumption_type”: the type of self-consumption to which the users are subscribed; “p1”, “p2”, “p3”, “p4”, “p5”, “p6”: contracted power (in kW) for each of the six time slots; “province”: province where the user is located; “municipality”: municipality where the user is located (municipalities below 50.000 inhabitants have been removed); “zip_code”: post code (post codes of municipalities below 50.000 inhabitants have been removed); “cnae”: CNAE (Clasificación Nacional de Actividades Económicas) code for economic activity classification. 5 star: ⭐⭐⭐ Preprocessing steps: Data cleaning (imputation of missing values using the Last Observation Carried Forward algorithm using weekly seasons); data integration (combination of multiple SIMEL files, i.e. the data sources); data transformation (anonymization, unit conversion, metadata generation). Reuse: This dataset is related to datasets: "A database of features extracted from different electricity load profiles datasets" (DOI 10.5281/zenodo.7382818), where time series feature extraction has been performed. "Measuring the flexibility achieved by a change of tariff" (DOI 10.5281/zenodo.7382924), where the metadata has been extended to include the results of a socio-economic characterization and the answers to a survey about barriers to adapt to a change of tariff. Update policy: There might be a single update in mid-2023. Ethics and legal aspects: The data provided by GoiEner contained values of the CUPS (Meter Point Administration Number), which are personal data. A pre-processing step has been carried out to replace the CUPS by random 64-character hashes. Technical aspects: raw.tzst contains a 15.1 GB folder with 25,559 CSV files; imp-pre.tzst contains a 6.28 GB folder with 12,149 CSV files; imp-in.tzst contains a 4.36 GB folder with 15.562 CSV files; and imp-post.tzst contains a 4.01 GB folder with 17.519 CSV files. Other: None.
g
Citywide Nonprofit Spending
gimi9.com
data.sfgov.org
+1more
Updated Dec 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Citywide Nonprofit Spending [Dataset]. https://gimi9.com/dataset/data-gov_citywide-nonprofit-spending/
Explore at:
Dataset updated
Dec 12, 2024
Description
Summary The City and County of San Francisco contracts with hundreds of nonprofit organizations to provide services for San Franciscans. These services include healthcare, legal aid, shelter, children’s programming, and more. This dataset contains all payments issued to nonprofit organizations by City departments since FY2019. This dataset will be updated at the close of each fiscal year. The underlying data is pulled from Supplier Payments on SF OpenBook. Please use SF OpenBook to find current-year data. The data in this dataset are presented in easy-to-read dashboards on our website. View the dashboards here: https://www.sf.gov/data/san-francisco-nonprofit-contracts-and-spending. How the dataset is created The Controller’s Office performs several significant data cleaning steps before uploading this dataset to the SF Open Data Portal. Please read the cleaning steps below: Cleaning Steps 1. SF OpenBook provides a filter labeled “Non-Profits Only” (Yes, No), and resulting datasets exported from SF OpenBook include a “Non Profit” column to indicate whether the supplier is a nonprofit (Yes, Blank). However, this field is not always accurate and excludes about 150 known nonprofits that are not labeled as a nonprofit in the City’s financial system. To ensure a complete dataset, we exported a full list of supplier payment data from SF OpenBook with the “Non-Profits Only” field filtered to “No” which provides a list of all supplier payments regardless of nonprofit status. We cleaned this data by adding a new “Nonprofit” column within the dataset and used this column to note a nonprofit status of “Yes” for approximately 150 known nonprofit suppliers without this indicator flagged in the financial system in addition to any nonprofits already accurately flagged in the system. We then filtered the full dataset using the new nonprofit column and used the filtered data for all of the dashboards on the webpage linked above. The list of excluded nonprofits may change over time as information gets updated in the City’s data system. Download the cleaned and updated dataset on the City’s Open Data Portal, which includes all of the known nonprofits. 2. While the University of California, San Francisco (UCSF) is technically not-for-profit, a university’s financial management is very different from traditional nonprofit service providers, and the City’s agreement with UCSF includes hospital staffing in addition to contracted services to the public. As such, the Controller's Office created a nonprofit column to be able to exclude payments to UCSF when reporting on overall spending. There are divisions of UCSF that provide more traditional contracted services, but these cannot be clearly identified in the data. Note that filtering out this data may reflect an underrepresentation of overall spending. 3. The Controller's Office also excludes several specific contracts that are predominately “pass through” payments where the nonprofit provider receives funds that they disperse to other agencies, such as for childcare or workforce subsidies. These types of contracts are substantially different from contracts where the nonprofit is providing direct services to San Franciscans. Update process This dataset will be manually updated after year-end financial processing is complete, typically in September. There may be a delay between the end of the fiscal year and the publication of this dataset.
GRACEnet: GHG Emissions, C Sequestration and
kaggle.com
zip
Updated Jan 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). GRACEnet: GHG Emissions, C Sequestration and [Dataset]. https://www.kaggle.com/datasets/thedevastator/gracenet-ghg-emissions-c-sequestration-and-envir
Explore at:
zip(1943875 bytes)Available download formats
Dataset updated
Jan 19, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
GRACEnet: GHG Emissions, C Sequestration and Environmental Benefits

Quantifying Climate Change Mitigation and Sustainable Agricultural Practices

By US Open Data Portal, data.gov [source]

About this dataset

This Kaggle dataset showcases the groundbreaking research undertaken by the GRACEnet program, which is attempting to better understand and minimize greenhouse gas (GHG) emissions from agro-ecosystems in order to create a healthier world for all. Through multi-location field studies that utilize standardized protocols – combined with models, producers, and policy makers – GRACEnet seeks to: typify existing production practices, maximize C sequestration, minimize net GHG emissions, and meet sustainable production goals. This Kaggle dataset allows us to evaluate the impact of different management systems on factors such as carbon dioxide and nitrous oxide emissions, C sequestration levels, crop/forest yield levels – plus additional environmental effects like air quality etc. With this data we can start getting an idea of the ways that agricultural policies may be influencing our planet's ever-evolving climate dilemma

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

Step 1: Familiarize yourself with the columns in this dataset. In particular, pay attention to Spreadsheet tab description (brief description of each spreadsheet tab), Element or value display name (name of each element or value being measured), Description (detailed description), Data type (type of data being measured) Unit (unit of measurement for the data) Calculation (calculation used to determine a value or percentage) Format (format required for submitting values), Low Value and High Value (range for acceptable entries).

Step 2: Familiarize yourself with any additional information related to calculations. Most calculations made use of accepted best estimates based on standard protocols defined by GRACEnet. Every calculation was described in detail and included post-processing steps such as quality assurance/quality control changes as well as measurement uncertainty assessment etc., as available sources permit relevant calculations were discussed collaboratively between all participating partners at every level where they felt necessary. All terms were rigorously reviewed before all partners agreed upon any decision(s). A range was established when several assumptions were needed or when there was a high possibility that samples might fall outside previously accepted ranges associated with standard protocol conditions set up at GRACEnet Headquarters laboratories resulting due to other external factors like soil type, climate etc,.

Step 3: Determine what types of operations are allowed within each spreadsheet tab (.csv file). For example on some tabs operations like adding an entire row may be permitted but using formulas is not permitted since all non-standard manipulations often introduce errors into an analysis which is why users are encouraged only add new rows/columns provided it is seen fit for their specific analysis operations like fill blank cells by zeros or delete rows/columns made redundant after standard filtering process which have been removed earlier from different tabs should be avoided since these nonstandard changes create unverified extra noise which can bias your results later on during robustness testing processes related to self verification process thereby creating erroneous output results also such action also might result into additional FET values due API's specially crafted excel documents while selecting two ways combo box therefore

Research Ideas

Analyzing and comparing the environmental benefits of different agricultural management practices, such as crop yields and carbon sequestration rates.

Developing an app or other mobile platform to help farmers find management practices that maximize carbon sequestration and minimize GHG emissions in their area, based on their specific soil condition and climate data.

Building an AI-driven model to predict net greenhouse gas emissions and C sequestration from potential weekly/monthly production plans across different regions in the world, based on optimal allocation of resources such as fertilizers, equipment, water etc

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the ...
East Atlantic SWAN Wave Model Significant Wave
kaggle.com
zip
Updated Jan 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). East Atlantic SWAN Wave Model Significant Wave [Dataset]. https://www.kaggle.com/datasets/thedevastator/east-atlantic-swan-wave-model-significant-wave-h/data
Explore at:
zip(6934967 bytes)Available download formats
Dataset updated
Jan 19, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
East Atlantic SWAN Wave Model Significant Wave Height

6 Day Forecast Using NCEP GFS Wind Forcing

By data.gov.ie [source]

About this dataset

This dataset contains data from the East Atlantic SWAN Wave Model, which is a powerful model developed to predict wave parameters in Irish waters. The output features of the model include Significant Wave Height (m), Mean Wave Direction (degreesTrue) and Mean Wave Period (seconds). These predictions are generated with NCEP GFS wind forcing and FNMOC Wave Watch 3 data as boundaries for the wave generation.

The accuracy of this model is important for safety critical applications as well as research efforts into understanding changes in tides, currents, and sea levels, so users are provided with up-to-date predictions for the previous 30 days and 6 days into the future with download service options that allow selection by date/time, one parameter only and output file type.

Data providers released this dataset under a Creative Commons Attribution 4.0 license at 2017-09-14. It can be used free of charge within certain restrictions set out by its respective author or publisher

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

Introduction:

Step 1: Acquire the Dataset:
The first step is getting access to the dataset which is free of cost. The original source of this data is from http://wwave2.marinecstl.org/archive/index?cat=model_height&xsl=download-csv-1 Meanwhile, you can also get your hands on this data by downloading it as a csv file from Kaggle’s website (https://www.kaggle.com/marinecstl/east-atlantic-swan-wave-model). This download should contain seven columns of various parameters; time, latitude, longitude, and significant wave height being the most important ones that you need to be familiar with before using this data set effectively in any project etc..

Step 2: Understand Data Columns & Parameters :
Now that you have downloaded the data its time to understand what each column represents and how they are related to each other when comparing datasets from two different locations within one country or across two countries etc.. Time represents daily timestamps for each observation taken at an exact location specified by latitude & longitude parameters respectively while ranging between 0° - +90° (~ 85 degrees) where higher values indicate states closer towards North Pole; inversely lower values indicates states closer towards South Pole respectively.. Significant wave height on other hand represent total displacements in ocean surface due measurable variations within short period caused either due tides or waves i .e caused due weather difference such as wind forcing or during more extreme conditions like oceanic storms etc.,

Step 3: Understanding Data Limitation & Applying Exclusion Criteria :
Moreover, keep in mind that since model runs every day across various geographical regions thus inevitable inaccuracy emerges regarding value predictions across any given timeslot; so its essential that users apply advanced criteria during analysis phase taking into consideration natural resource limitation such as current weather conditions and water depth scenarios while compiling buoyancy related readings during particular timestamps respectively when going through information outputted via obtained CSV file OR API services respectively;; however don’t forget these ;predictions may not be used for safety

Research Ideas

Visualizing wave heights in the East Atlantic area over time to map oceanic currents.

Finding areas of high-wave activity: using this data, researchers can identify unique areas that experience particularly severe waves, which could be essential to know for protecting maritime vessels and informing navigation strategies.

Predicting future wave behavior: by analyzing current and past trends in SWAN Wave Model data, scientists can predict how significant wave heights will change over future timescales in the studied area

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: download-csv-1.csv | Column name | Descrip...
w
Dataset of author, BNB id, book publisher, and publication date of Step back...
workwithdata.com
Updated Apr 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2025). Dataset of author, BNB id, book publisher, and publication date of Step back from the baggage claim : change the world, start at the airport [Dataset]. https://www.workwithdata.com/datasets/books?col=author%2Cbnb_id%2Cbook%2Cbook_publisher&f=1&fcol0=book&fop0=%3D&fval0=Step+back+from+the+baggage+claim+%3A+change+the+world%2C+start+at+the+airport
Explore at:
Dataset updated
Apr 17, 2025
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about books. It has 1 row and is filtered where the book is Step back from the baggage claim : change the world, start at the airport. It features 4 columns: author, book publisher, and BNB id.
r
Respiration_chambers/raw_log_files and combined datasets of biomass and...
researchdata.edu.au
data.aad.gov.au
Updated Dec 3, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BLACK, JAMES GEOFFREY; Black, J.G.; BLACK, JAMES GEOFFREY; BLACK, JAMES GEOFFREY (2018). Respiration_chambers/raw_log_files and combined datasets of biomass and chamber data, and physical parameters [Dataset]. https://researchdata.edu.au/respirationchambersrawlogfiles-combined-datasets-physical-parameters/1360456
Explore at:
Dataset updated
Dec 3, 2018
Dataset provided by
Australian Antarctic Data Centre
Australian Antarctic Division
Authors
BLACK, JAMES GEOFFREY; Black, J.G.; BLACK, JAMES GEOFFREY; BLACK, JAMES GEOFFREY
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 27, 2015 - Feb 23, 2015
Area covered

Description
General overview
The following datasets are described by this metadata record, and are available for download from the provided URL.

- Raw log files, physical parameters raw log files
- Raw excel files, respiration/PAM chamber raw excel spreadsheets
- Processed and cleaned excel files, respiration chamber biomass data
- Raw rapid light curve excel files (this is duplicated from Raw log files), combined dataset pH, temperature, oxygen, salinity, velocity for experiment
- Associated R script file for pump cycles of respirations chambers

####

Physical parameters raw log files

Raw log files
1) DATE=
2) Time= UTC+11
3) PROG=Automated program to control sensors and collect data
4) BAT=Amount of battery remaining
5) STEP=check aquation manual
6) SPIES=check aquation manual
7) PAR=Photoactive radiation
8) Levels=check aquation manual
9) Pumps= program for pumps
10) WQM=check aquation manual

####

Respiration/PAM chamber raw excel spreadsheets

Abbreviations in headers of datasets
Note: Two data sets are provided in different formats. Raw and cleaned (adj). These are the same data with the PAR column moved over to PAR.all for analysis. All headers are the same. The cleaned (adj) dataframe will work with the R syntax below, alternative add code to do cleaning in R.

Date: ISO 1986 - Check
Time:UTC+11 unless otherwise stated
DATETIME: UTC+11 unless otherwise stated
ID (of instrument in respiration chambers)
ID43=Pulse amplitude fluoresence measurement of control
ID44=Pulse amplitude fluoresence measurement of acidified chamber
ID=1 Dissolved oxygen
ID=2 Dissolved oxygen
ID3= PAR
ID4= PAR
PAR=Photo active radiation umols
F0=minimal florescence from PAM
Fm=Maximum fluorescence from PAM
Yield=(F0 – Fm)/Fm
rChl=an estimate of chlorophyll (Note this is uncalibrated and is an estimate only)
Temp=Temperature degrees C
PAR=Photo active radiation
PAR2= Photo active radiation2
DO=Dissolved oxygen
%Sat= Saturation of dissolved oxygen
Notes=This is the program of the underwater submersible logger with the following abreviations:
Notes-1) PAM=
Notes-2) PAM=Gain level set (see aquation manual for more detail)
Notes-3) Acclimatisation= Program of slowly introducing treatment water into chamber
Notes-4) Shutter start up 2 sensors+sample…= Shutter PAMs automatic set up procedure (see aquation manual)
Notes-5) Yield step 2=PAM yield measurement and calculation of control
Notes-6) Yield step 5= PAM yield measurement and calculation of acidified
Notes-7) Abatus respiration DO and PAR step 1= Program to measure dissolved oxygen and PAR (see aquation manual). Steps 1-4 are different stages of this program including pump cycles, DO and PAR measurements.

8) Rapid light curve data
Pre LC: A yield measurement prior to the following measurement
After 10.0 sec at 0.5% to 8%: Level of each of the 8 steps of the rapid light curve
Odessey PAR (only in some deployments): An extra measure of PAR (umols) using an Odessey data logger
Dataflow PAR: An extra measure of PAR (umols) using a Dataflow sensor.
PAM PAR: This is copied from the PAR or PAR2 column
PAR all: This is the complete PAR file and should be used
Deployment: Identifying which deployment the data came from

####

Respiration chamber biomass data

The data is chlorophyll a biomass from cores from the respiration chambers. The headers are: Depth (mm) Treat (Acidified or control) Chl a (pigment and indicator of biomass) Core (5 cores were collected from each chamber, three were analysed for chl a), these are psudoreplicates/subsamples from the chambers and should not be treated as replicates.

####

Associated R script file for pump cycles of respirations chambers

Associated respiration chamber data to determine the times when respiration chamber pumps delivered treatment water to chambers. Determined from Aquation log files (see associated files). Use the chamber cut times to determine net production rates. Note: Users need to avoid the times when the respiration chambers are delivering water as this will give incorrect results. The headers that get used in the attached/associated R file are start regression and end regression. The remaining headers are not used unless called for in the associated R script. The last columns of these datasets (intercept, ElapsedTimeMincoef) are determined from the linear regressions described below.

To determine the rate of change of net production, coefficients of the regression of oxygen consumption in discrete 180 minute data blocks were determined. R squared values for fitted regressions of these coefficients were consistently high (greater than 0.9). We make two assumptions with calculation of net production rates: the first is that heterotrophic community members do not change their metabolism under OA; and the second is that the heterotrophic communities are similar between treatments.

####

Combined dataset pH, temperature, oxygen, salinity, velocity for experiment

This data is rapid light curve data generated from a Shutter PAM fluorimeter. There are eight steps in each rapid light curve. Note: The software component of the Shutter PAM fluorimeter for sensor 44 appeared to be damaged and would not cycle through the PAR cycles. Therefore the rapid light curves and recovery curves should only be used for the control chambers (sensor ID43).

The headers are
PAR: Photoactive radiation
relETR: F0/Fm x PAR
Notes: Stage/step of light curve
Treatment: Acidified or control

The associated light treatments in each stage. Each actinic light intensity is held for 10 seconds, then a saturating pulse is taken (see PAM methods).

After 10.0 sec at 0.5% = 1 umols PAR
After 10.0 sec at 0.7% = 1 umols PAR
After 10.0 sec at 1.1% = 0.96 umols PAR
After 10.0 sec at 1.6% = 4.32 umols PAR
After 10.0 sec at 2.4% = 4.32 umols PAR
After 10.0 sec at 3.6% = 8.31 umols PAR
After 10.0 sec at 5.3% =15.78 umols PAR
After 10.0 sec at 8.0% = 25.75 umols PAR

This dataset appears to be missing data, note D5 rows potentially not useable information

See the word document in the download file for more information.
Data from: PatagoniaMet: A multi-source hydrometeorological dataset for...
zenodo.org
data.niaid.nih.gov
csv, nc, txt
Updated Oct 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rodrigo Aguayo; Jorge León-Muñoz; Mauricio Aguayo; Oscar Baez-Villanueva; Mauricio Zambrano-Bigiarini; Alfonso Fernández; Martín Jacques-Coper; Rodrigo Aguayo; Jorge León-Muñoz; Mauricio Aguayo; Oscar Baez-Villanueva; Mauricio Zambrano-Bigiarini; Alfonso Fernández; Martín Jacques-Coper (2023). PatagoniaMet: A multi-source hydrometeorological dataset for Western Patagonia [Dataset]. http://doi.org/10.5281/zenodo.7992761
Explore at:
nc, csv, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7992761
Dataset updated
Oct 10, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Rodrigo Aguayo; Jorge León-Muñoz; Mauricio Aguayo; Oscar Baez-Villanueva; Mauricio Zambrano-Bigiarini; Alfonso Fernández; Martín Jacques-Coper; Rodrigo Aguayo; Jorge León-Muñoz; Mauricio Aguayo; Oscar Baez-Villanueva; Mauricio Zambrano-Bigiarini; Alfonso Fernández; Martín Jacques-Coper
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
PatagoniaMet v1.0 (PMET from here on) is new dataset for Western Patagonia which comprises two datasets: i) PMET-obs, a compilation of quality-controlled ground-based hydrometeorological data, and ii) PMET-sim, a daily gridded product of precipitation, and maximum and minimum temperature. PMET-obs was developed using a 4-step quality control process applied to 523 hydro-meteorological time series (precipitation, air temperature, potential evaporation, streamflow and lake level stations) obtained from eight institutions in Chile and Argentina. Based on this dataset and currently available uncorrected gridded products (ERA5), PMET-sim was developed using statistical bias correction procedures (i.e., quantile mapping), spatial regression models (random forest) and hydrological methods (Budyko framework). The details of each dataset are the following:

- PMET-obs is a compilation of five hydrometeorological variables obtained from eigth institutions in Chile and Argentina. The daily quality-controlled data of each variable is stored in separate .csv files with the following naming convention: variable_PMETobs_1950_2020_v10d.csv. Each column represents a different gauge with its "gauge_id". Each variable has a additional .csv file with the metadata of each station (variable_PMETobs_v10_metadata.csv). For all variables, the metadata is at least the name (gauge_name), the institution, the station location (gauge_lat and gauge_lon), the altitude (gauge_alt) and the total number of daily records (length). Following current guidelines for hydrological datasets, the upstream area corresponding to each stream gauge was delimited, and several climatic and geographic attributes were derived. The details of the attributes can be found in the README file.

- PMET-sim is a daily gridded product with a spatial resolution of 0.05° covering the period 1980-2020. The data for each variable (precipitation and maximum and minimum temperature) are stored in separate netcdf files with the following naming convention: variable_PMETsim_1980_2020_v10d.nc.

Citation: Aguayo, R., León-Muñoz, J., Aguayo, M., Baez-Villanueva, O., Fernandez, A. Zambrano-Bigiarini, M., and Jacques-Coper, M. (2023) PatagoniaMet v1.0: A A multi-source hydrometeorological dataset for Western Patagoniaa (40-56ºS). Submitted to Scientific Data.

Code repository: https://github.com/rodaguayo/PatagoniaMet
Supplement 1. Data used to simulate dose-response, broken-stick,...
wiley.figshare.com
html
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thomas F. Cuffney; Song S. Qian; Robin A. Brightbill; Jason T. May; Ian R. Waite (2023). Supplement 1. Data used to simulate dose-response, broken-stick, step-function, and linear response models in the evaluation of TITAN-derived change points. [Dataset]. http://doi.org/10.6084/m9.figshare.3516695.v1
Explore at:
htmlAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3516695.v1
Dataset updated
Jun 4, 2023
Dataset provided by
Wileyhttps://www.wiley.com/
Authors
Thomas F. Cuffney; Song S. Qian; Robin A. Brightbill; Jason T. May; Ian R. Waite
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
File List Model_Sim_data.txt Description The file Model_Sim_data.txt is a tab-delimited text file containing the data used to simulate dose-response, broken-stick, step-function, and linear response models in the evaluation of TITAN-derived change points Column definitions:

STAID: station identifier Urb: urban intensity BS: broken stick model with threshold at 0.5 Lin: linear model with no threshold STP: step-function model with threshold at 0.5 DR: does-response model with thresholds at 0.35 and 0.65 Checksums: -- TABLE: Please see in attached file. --
US Means of Transportation to Work Census Data
kaggle.com
zip
Updated Feb 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sagar G (2022). US Means of Transportation to Work Census Data [Dataset]. https://www.kaggle.com/goswamisagard/american-census-survey-b08301-cleaned-csv-data
Explore at:
zip(3388809 bytes)Available download formats
Dataset updated
Feb 23, 2022
Authors
Sagar G
Area covered
United States
Description

US Census Bureau conducts American Census Survey 1 and 5 Yr surveys that record various demographics and provide public access through APIs. I have attempted to call the APIs through the python environment using the requests library, Clean, and organize the data in a usable format.

Data Ingestion and Cleaning:

ACS Subject data [2011-2019] was accessed using Python by following the below API Link: https://api.census.gov/data/2011/acs/acs1?get=group(B08301)&for=county:* The data was obtained in JSON format by calling the above API, then imported as Python Pandas Dataframe. The 84 variables returned have 21 Estimate values for various metrics, 21 pairs of respective Margin of Error, and respective Annotation values for Estimate and Margin of Error Values. This data was then undergone through various cleaning processes using Python, where excess variables were removed, and the column names were renamed. Web-Scraping was carried out to extract the variables' names and replace the codes in the column names in raw data.

The above step was carried out for multiple ACS/ACS-1 datasets spanning 2011-2019 and then merged into a single Python Pandas Dataframe. The columns were rearranged, and the "NAME" column was split into two columns, namely 'StateName' and 'CountyName.' The counties for which no data was available were also removed from the Dataframe. Once the Dataframe was ready, it was separated into two new dataframes for separating State and County Data and exported into '.csv' format

Data Source:

More information about the source of Data can be found at the URL below: US Census Bureau. (n.d.). About: Census Bureau API. Retrieved from Census.gov https://www.census.gov/data/developers/about.html

Final Word:

I hope this data helps you to create something beautiful, and awesome. I will be posting a lot more databases shortly, if I get more time from assignments, submissions, and Semester Projects 🧙🏼‍♂️. Good Luck.
Data from: WP3_T3.2_BRA_EU_trade_sensitivity_equivalents_estimates
data.europa.eu
unknown
Updated Jan 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2025). WP3_T3.2_BRA_EU_trade_sensitivity_equivalents_estimates [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-13643286?locale=el
Explore at:
unknownAvailable download formats
Dataset updated
Jan 11, 2025
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
European Union
Description
The dataset contains the model output of the Bayesian Poisson High Dimensional Fixed Effects (BHPDFE) structural gravity model for the deliverable of CLEVER WP3/T3.2: Estimation of trade stickiness and trade substitution effects for selected products. The information in the csv-file can be used in CLEVER WP6/T.6.2 & T.6.5 to map empirical estimates of trade-elasticity in the soybean sector in GLOBIOM equivalent terms. The data is structured as follows: Column names WP3_T3.2_BRA_EU_trade_sensitivity_equivalents_estimates.csv item exporter importer GLOBIOM_timestep GLOBIOM_relative_change GLOBIOM_trade_cost_parameter BHPDFE_gravity_CF_scenario BHPDFE_gravity_CF_time_frame GLOBIOM_shifter_value BPHPDFE_gravity_shifter BPHPDFE_gravity_estimate Short description GLOBIOM product type exporting GLOBIOM region importing GLOBIOM region time step of GLOBIOM output GLOBIOM relative percentage change to the baseline quantities in trade of item between exporter and importer (used to match BHPDFE equivalents) Name of GLOBIOM trade cost parameter in the sensitivity analysis Name of counterfactual scenario of the BHPDFE gravity model analysis; in parentheses source of effect Time frame of the BHPDFE gravity model counterfactual analysis Shifter value of GLOBIOM_trade_cost_parameter (corresponding to BPHPDFE_gravity_shifter) Shifter value used in the counterfactual estimation of the BPHDFE_gravity_shifter (corresponding to GLOBIOM_shifter_value) Key underlying parameter value of source described in BHPDFE_gravity_CF_scenario This version covers:item(s): Soyaexporter(s): Brazilimporter(s): EU
Z
Aging dataset LCO battery with mechanical measurements
data.niaid.nih.gov
Updated Mar 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Clerici, Davide; Pistorio, Francesca; Somà, Aurelio (2025). Aging dataset LCO battery with mechanical measurements [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14914171
Explore at:
Dataset updated
Mar 7, 2025
Dataset provided by
Polytechnic University of Turin
Authors
Clerici, Davide; Pistorio, Francesca; Somà, Aurelio
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Aging dataset of a commercial 22Ah LCO-graphite pouch Li-Po battery.

The cycling procedure involves aging steps consisting of 22 aging cycles at 1C CC discharge and C/2 CC-CV charge, with no pauses in between. Periodic RPTs are carried out after each aging step. In particular, two series of RPTs are alternated, referred to as RPT-A and RPT-B, with this pattern: 22 aging cycles -> RPT-A -> 22 aging cycles -> RPT-A + RPT-B -> repeat.

The RPT-A consists of three high rate cycles (1C CC discharge and C/2 CC-CV charge) with 1 hour rest.The RPT-B consists of three high rate cycles (1C CC discharge and C/2 CC-CV charge) with 1 hour rest, one low rate cycle (C/20) and the HPPC test. In this way, high rate test cycles are carried out periodically every 25 cycles (22 aging + 3 test), whereas low rate test cycles and HPPC are carried out every 50 cycles. The exact number at which each reference performance test was carried out is reported in the sixth column of the data structure.

In total, 1125 cycles were achieved untill SOH 70%.

The cycling reference performance tests (high rate cycling 1C-C/2, and low rate cycling C/20-C/20) are reported in the MATLAB structure called Aging_Dataset_Cycling. On the other, the data of the HPPC tests are reported in the MATLAB structure called Aging_Dataset_HPPC.

The data structure of cycling reference performance tests is a MATLAB cell organized so that in the first row there are data of RPT-A (high rate cycles), and in the second row the data of RPT-B (low rate cycles). In the first column there are discharge data, in the second column the charge data, in the third column the data recorded in the one hour rest after discharge and in the fourth column the data recorded in the one hour rest after charge. In each element of this 2x4 matrix there is a cell containing the structures referring to each reference performance tests. The different reference performance tests are organized so that in the row there are the reference performance tests carried out at different aging cycles (detailed in the vector in the sixth column of the main data structure) and in the column there are the tests repeated at the same aging cycles for statistical studies. Generally RPT-A tests are repeated three times and RPT-B tests are repeated one times. Then, each cell, e.g. D{1,1}{1,1} contains a structure with the data of that test coded as explained in the bullet list below.

The data recorded during the reference performance test, reported in the data structure, were:

Time test [s]. Variable name: Time.

Battery temperature [°C]. Variable name: T_batt.

Ambient temperature [°C]. Variable name: T_amb.

Battery voltage [V]. Variable name: V_batt.

Charging current [A]. Variable name: I_PS

Discharging current [A]. Variable name: I_EL

Laser sensor 1 reading [V]. Variable name: Las1

Laser sensor 2 reading [V]. Variable name: Las2

Battery deformation [mm], meant as the thickness change of the battery. Variable name: Dthk

Deformation measurements were carried out measuring the out-of-plane displacement of the two largest surfaces of the battery with a couple of laser sensors, as explained in these Figures. The two sensor readings are expressed in Volt, ranging from 0V (start measuring distance) to 10V (end measuring distance), and are proportional to the distance between the laser (fixed) and the battery surface (moving because of the thickness change). The reversible deformation within a single cycle is already computed in the variable Battery deformation and it is expressed in millimeter. The reversible deformation is computed as the sum of the two laser readings (1V = 1mm), net of the sum of the two initial laser readings. The single laser readings are useful to compute the irreversible deformation, namely how the thickness of the battery changes during aging. This is possible because the laser remained fixed during the whole aging test, and the reference was not lost. Therefore, to calculate the deformation of the battery at any given moment during the aging test, it is necessary to sum the two laser readings at the given moment and subtract the sum of the two initial laser readings.

Example of the data structure: D{1,1} contains all the discharge data of all the RPT-A tests. In total, there are 47 lines and 4 columns, because RPT-A tests were conducted at 47 different aging levels (the respective number of cycles is reported in the vector stored in the sixth column first row of the main data structure), and the tests are repeated up to 4 times at the same aging level, even if most of the time were repeated just three times. Then, D{1,1}{1,1} contains the discharge data of the first reference performance(RPT-A) test carried out at the first aging level (10 cycles), D{1,1}{1,2} contains the discharge data of the second reference performance(RPT-A) test carried out at the first aging level, D{1,1}{2,1} contains the discharge data of the first reference performance (RPT-A) test carried out at the second aging level (20 cycles), and so on. D{1,2} contains all the charge data of all the RPT-A tests and D{2,1} and D{2,2} contain all the discharge and charge data of the RPT-B (low rate-C/20) test. The substructures work similarly as described for D{1,1}.

The data structure of the HPPC reference performance tests is a MATLAB cell organized so that in the rows there are the data referring to different aging cycles, and the first ten columns correspond to the SOC at which the HPPC test is carried out, going from 100% to 10%. The 11th contains the number of aging cycles at which the test in that column was carried out. Each structure in this matrix refers to a single HPPC test and contains the following data:

Time test [s]. Variable name: Time.

Battery voltage [V]. Variable name: V_batt.

Charging current [A]. Variable name: I_PS

Discharging current [A]. Variable name: I_EL

Ambient temperature was controlled with a climatic chamber and it was kept constant at 20°C during all the tests.
Textual Dataset of Articles From WOS and Scopus
kaggle.com
Updated Sep 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zakria Saad (2022). Textual Dataset of Articles From WOS and Scopus [Dataset]. https://www.kaggle.com/datasets/zakriasaad1/learn-to-prepare-data-for-machine-learning
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 15, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Zakria Saad
Description
You have 5 '.xls' by the name savedrecs The files contain articles related to the chemistry with focus in ML and AI topics. Besides it, you have 2 extra files for your interpretation. One of the significances of this dataset is to teach all different methodologies for various kinds of data in one dataset. Another importance is to deal with novel data. Therefore this dataset presents a progression in your career steps. Below are the steps you should be able to take on the provided datasets 1. apply the appropriate concatenation method for joining the given files. 2. transform the categorical data into numerical ones with a suitable strategy. 3. decide which features are significant for the aim of the described scenario. 4. select the required features of the dataset. 5. investigate the correct strategy for filling nan values of the dataset. 6. demonstrate an understandable visualization for the time series. 7. develop a new column using the existing columns according to the purpose of the scenario. 8. interpret and appraise the dataset. 9. apply the methodology for handling the textual data. 10. convert the textual data to numerical data form. 11. present what (s)he did throughout his study
Divvy Bike Share Analysis
kaggle.com
zip
Updated Sep 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabe Puente (2024). Divvy Bike Share Analysis [Dataset]. https://www.kaggle.com/gabepuente/divvy-bike-share-analysis
Explore at:
zip(533248770 bytes)Available download formats
Dataset updated
Sep 21, 2024
Authors
Gabe Puente
Description
Business Task:

The primary business task is to analyze how casual riders and annual members use Cyclistic's bike-share services differently. The insights gained from this analysis will help the marketing team develop strategies aimed at converting casual riders into annual members. This analysis needs to be supported by data and visualizations to convince the Cyclistic executive team.

Key Identifiers for the Case Study:

Casual Riders vs. Annual Members: The core focus of the case study is on the behavioral differences between casual riders and annual members. Cyclistic Historical Trip Data: The data being used is Cyclistic's bike-share trip data, which includes variables like trip duration, start and end stations, user type (casual or member), and bike IDs. Goal: The goal is to design a marketing strategy that targets casual riders and converts them into annual members, as annual members are more profitable for the company.

Key Stakeholders:

Lily Moreno: Director of marketing, responsible for Cyclistic’s marketing strategy. Cyclistic Marketing Analytics Team: The team analyzing and reporting on the data. Cyclistic Executive Team: The decision-makers who need to be convinced by the analysis to approve the proposed marketing strategy.

File Description

Column Descriptions:

trip_id: identifier for each bike trip.

start_time: The start date and time of the trip.

end_time: The end date and time of the trip.

bikeid: identifier for the bike used.

tripduration: Duration of the trip in numerical.

from_station_id: ID of the station where the trip started.

from_station_name: Name of the station where the trip started.

to_station_id: ID of the station where the trip ended.

to_station_name: Name of the station where the trip ended.

usertype: Rider type, either 'Member' or 'Casual'.

gender: Rider’s gender.

birthyear: Rider’s birth year.

For Q2 in Raw there is incorrect column names - 01 - Rental Details Rental ID: identifier for each bike rental. - 01 - Rental Details Local Start Time: The local date and time when the rental started, recorded in MM/DD/YYYY HH:MM format. - 01 - Rental Details Local End Time: The local date and time when the rental ended, recorded in MM/DD/YYYY HH:MM format. - 01 - Rental Details Bike ID: identifier for the bike used during the rental. - 01 - Rental Details Duration In Seconds Uncapped: The total duration of the rental in seconds, including trips that exceed standard time limits (uncapped). - 03 - Rental Start Station ID: identifier for the station where the rental began. - 03 - Rental Start Station Name: The name of the station where the rental began. - 02 - Rental End Station ID: identifier for the station where the rental ended. - 02 - Rental End Station Name: The name of the station where the rental ended. - User Type: Specifies whether the user is a "Subscriber" (member) or a "Customer" rider (casual). - Member Gender: The gender of the member (if available). - 05 - Member Details Member Birthyear: The birth year of the member (if available).

Steps Taken:

Excel Cleaning Steps

Combined Data: Combined the 2019 Q1-Q4 data into one workbook for a unified dataset.

Calculated Ride Length: Replaced trip duration with a new calculated column ride_length using ride_length = D2 - C2 to reflect the trip’s duration.

Created Day of Week Column: Added a day_of_week column using the formula =TEXT(C2,"dddd") to extract the weekday from the start time.

Removed Outliers: Removed trips longer than 24 hours to eliminate outliers.

Removed Columns: Dropped gender and birthyear columns due to excessive missing values.

Formatting: Standardized date and time formats to MM/DD/YYYY HH:MM and ensured uniform number formatting for trip IDs.

Saved Workbook: Saved the cleaned dataset for further analysis.

SQL Data Preparation Steps

Data Upload: Uploaded each quarter’s data to SQL and stored them as separate tables (Q1, Q2, Q3, Q4).

Row Count Check: Verified total rows to ensure data integrity using SQL queries.

Distinct Rider Types: Checked for distinct values in the member_casual column to ensure correct identification of casual riders and members.

Calculated Trip Durations: Used SQL to find the maximum, minimum, and average trip durations for deeper insights.

Data Union: Combined data from all four quarters into a unified table using a UNION ALL query.

Grouped Analysis: Performed grouping and aggregations by rider type, time of day, day of the week, and stations to understand usage patterns.

Calculated Seasonal and Daily Trends: Used SQL to analyze rides by time of day, day of the week, and by month to detect seasonality and daily variations.

*...
G
Great Lakes Basin Integrated Nutrient Dataset (2000-2019)
open.canada.ca
csv, html, txt
Updated Mar 17, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Environment and Climate Change Canada (2022). Great Lakes Basin Integrated Nutrient Dataset (2000-2019) [Dataset]. https://open.canada.ca/data/en/dataset/8eecfdf5-4fbc-43ec-a504-7e4ee41572eb
Explore at:
txt, csv, htmlAvailable download formats
Dataset updated
Mar 17, 2022
Dataset provided by
Environment and Climate Change Canada
License
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Time period covered
Oct 1, 1999 - Oct 1, 2019
Area covered
The Great Lakes
Description
The Great Lakes Basin Integrated Nutrient Dataset compiles and standardizes phosphorus, nitrogen, and suspended solids data collected between the 2000-2019 water years from multiple Canadian and American sources around the Great Lakes. Ultimately, the goal is to enable regional nutrient data analysis within the Great Lakes Basin. This data is not directly used in the Water Quality Monitoring and Surveillance Division tributary load calculations. Data processing steps include standardizing data column and nutrient names, date-time conversion to Universal Time Coordinates, normalizing concentration units to milligram per liter, and reporting all phosphorus and nitrogen compounds 'as phosphorus' or 'as nitrogen'. Data sources include the Environment and Climate Change Canada National Long-term Water Quality Monitoring Data (WQMS), the Provincial (Stream) Water Quality Monitoring Network (PWQMN) of the Ontario Ministry of the Environment, the Grand River Conservation Authority (GRCA) water quality data, and Heidelberg University’s National Center for Water Quality Research (NCWQR) Tributary Loading Program.
Mortality Statistics in US Cities
kaggle.com
zip
Updated Jan 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Mortality Statistics in US Cities [Dataset]. https://www.kaggle.com/datasets/thedevastator/mortality-statistics-in-us-cities
Explore at:
zip(96624 bytes)Available download formats
Dataset updated
Jan 23, 2023
Authors
The Devastator
Area covered
United States
Description
Mortality Statistics in US Cities

Deaths by Age and Cause of Death in 2016

By Health [source]

About this dataset

This dataset contains mortality statistics for 122 U.S. cities in 2016, providing detailed information about all deaths that occurred due to any cause, including pneumonia and influenza. The data is voluntarily reported from cities with populations of 100,000 or more, and it includes the place of death and the week during which the death certificate was filed. Data is provided broken down by age group and includes a flag indicating the reliability of each data set to help inform analysis. Each row also provides longitude and latitude information for each reporting area in order to make further analysis easier. These comprehensive mortality statistics are invaluable resources for tracking disease trends as well as making comparisons between different areas across the country in order to identify public health risks quickly and effectively

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset contains mortality rates for 122 U.S. cities in 2016, including deaths by age group and cause of death. The data can be used to study various trends in mortality and contribute to the understanding of how different diseases impact different age groups across the country.

In order to use the data, firstly one has to identify which variables they would like to use from this dataset. These include: reporting area; MMWR week; All causes by age greater than 65 years; All causes by age 45-64 years; All causes by age 25-44 years; All causes by age 1-24 years; All causes less than 1 year old; Pneumonia and Influenza total fatalities; Location (1 & 2); flag indicating reliability of data.

Once you have identified the variables that you are interested in,you will need to filter the dataset so that it only includes relevant information for your analysis or research purposes. For example, if you are looking at trends between different ages, then all you would need is information on those 3 specific cause groups (greater than 65, 45-64 and 25-44). You can do this using a selection tool that allows you to pick only certain columns from your data set or an excel filter tool if your data is stored as a csv file type .

Next step is preparing your data - it’s important for efficient analysis also helpful when there are too many variables/columns which can confuse our analysis process – eliminate unnecessary columns, rename column labels where needed etc ... In addition , make sure we clean up any missing values / outliers / incorrect entries before further investigation .Remember , outliers or corrupt entries may lead us into incorrect conclusions upon analyzing our set ! Once we complete the cleaning steps , now its safe enough transit into drawing insights !

The last step involves using statistical methods such as linear regression with multiple predictors or descriptive statistical measures such as mean/median etc ..to draw key insights based on analysis done so far and generate some actionable points !

With these steps taken care off , now its easier for anyone who decides dive into another project involving this particular dataset with added advantage formulated out of existing work done over our previous investigations!

Research Ideas

Creating population health profiles for cities in the U.S.

Tracking public health trends across different age groups

Analyzing correlations between mortality and geographical locations

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: Dataset copyright by authors - You are free to: - Share - copy and redistribute the material in any medium or format for any purpose, even commercially. - Adapt - remix, transform, and build upon the material for any purpose, even commercially. - You must: - Give appropriate credit - Provide a link to the license, and indicate if changes were made. - ShareAlike - You must distribute your contributions under the same license as the original. - Keep intact - all notices that refer to this license, including copyright notices.

Columns

File: rows.csv | Column name | Description | |:--------------------------------------------|:-----------------------------------...
Weather Seattle
kaggle.com
zip
Updated Apr 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AbdElRahman16 (2024). Weather Seattle [Dataset]. https://www.kaggle.com/datasets/abdelrahman16/weather-seattle
Explore at:
zip(11824 bytes)Available download formats
Dataset updated
Apr 18, 2024
Authors
AbdElRahman16
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Area covered
Seattle
Description
Seattle weather analysis involves understanding various meteorological variables recorded daily. Using the seattle-weather.csv file, we can explore weather patterns, seasonal changes, and predict future weather conditions in Seattle.

The dataset contains the following features:

🔍 Dataset Overview:

📅 Date: The date of the recorded weather data.

☔ Precipitation: The amount of precipitation (in mm) recorded on that day.

🌡️ Temp_max: The maximum temperature (in degrees Celsius) recorded on that day.

🌡️ Temp_min: The minimum temperature (in degrees Celsius) recorded on that day.

💨 Wind: The wind speed (in m/s) recorded on that day.

🌦️ Weather: The type of weather (e.g., drizzle, rain).

Steps to Analyze the Dataset:

1. Data Preprocessing:

Handle Missing Values: Ensure there are no missing values in the dataset.

Convert Data Types: Convert date columns to datetime format if necessary.

Normalize Numerical Variables: Scale features like Precipitation, Temp_max, Temp_min, and Wind if needed.

2. Feature Selection:

Select Relevant Features: Use techniques like correlation analysis to select features that contribute most to the analysis.

3. Exploratory Data Analysis:

Visualize Data: Create plots to understand the distribution and trends of different weather variables.

Seasonal Analysis: Analyze how weather patterns change with seasons.

4. Model Selection for Prediction:

Choose Algorithms: Consider various machine learning algorithms such as Linear Regression, Decision Tree, Random Forest, and Time Series models.

Compare Performance: Train multiple models and compare their performance.

5. Model Training and Evaluation:

Train Models: Train the selected models on the data.

Evaluate Performance: Use metrics such as RMSE, MAE, and R² score to evaluate model performance.

6. Model Deployment:

Deploy the Model: Deploy the best model for predicting future weather conditions.

Ensure Robustness: Make sure the model is robust and can handle real-world data.

Exploring This Dataset Can Help With:

📊 Weather Pattern Analysis: Understanding the weather patterns in Seattle.

🌼 Seasonal Changes: Gaining insights into seasonal variations in weather.

🌦️ Future Predictions: Predicting future weather conditions in Seattle.

🔍 Research: Providing a solid foundation for research in meteorology and climate studies.

This dataset is an invaluable resource for anyone looking to analyze weather patterns and predict future conditions in Seattle, offering detailed insights into the city's meteorological variables.

Please upvote if you find this helpful! 👍
Cary, NC Crash Data
kaggle.com
zip
Updated Jan 18, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Cary, NC Crash Data [Dataset]. https://www.kaggle.com/datasets/thedevastator/cary-nc-crash-data-2015-2022
Explore at:
zip(7717415 bytes)Available download formats
Dataset updated
Jan 18, 2023
Authors
The Devastator
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Area covered
Cary, North Carolina
Description
Cary, NC Crash Data

Injuries, Fatalities, and Contributing Factors

By Town of Cary [source]

About this dataset

The Town of Cary Crash Database contains five years worth of detailed crash data up to the current date. Each incident is mapped based on National Incident-Based Reporting System (NIBRS) criteria, providing greater accuracy and access to all available crashes in the County.

This valuable resource is constantly being updated – every day fresh data is added and older records are subject to change. The locations featured in this dataset reflect approximate points of intersection or impact. In cases when essential detail elements are missing or rendered unmapable, certain crash incidents may not appear on maps within this source.

We invite you to explore how crashes have influenced the Town of Cary over the past five years – from changes in weather conditions and traffic controls to vehicular types, contributing factors, travel zones and more! Whether it's analyzing road design elements or assessing fatality rates – come take a deeper look at what has shaped modern day policies for safe driving today!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

Understanding Data Elements – The first step in using this dataset is understanding what information is included in it. The data elements include location descriptions, road features, character traits of roads and more that are associated with each crash. Additionally, the data provides details about contributing factors, light conditions, weather conditions and more that can be used to understand why certain crashes happen in certain locations or under certain circumstances.

Research Ideas

Analyzing trends in crash locations to better understand where crashes are more likely to occur. For example, using machine learning techniques and geographical mapping tools to identify patterns in the data that could enable prediction of future hotspots of crashes.

Investigating the correlations between roadway characteristics (e.g., surface, configuration and class) and accident severities in order to recommend improvements or additional preventative measures at certain intersections or road segments which may help reduce crash-related fatalities/injuries.

Using data from various contributing factors (e.g., traffic control, weather conditions, work area) as an input for a predictive model for analyzing the risk factors associated with different types of crashes such as head-on collisions, rear-end collisions or side swipe accidents so that safety alerts can be issued for public awareness campaigns during specific times/days/conditions where such incidents have been known to occur more often or have increased severity repercussions than usual (i.e., near schools during school days)

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: Open Database License (ODbL) v1.0 - You are free to: - Share - copy and redistribute the material in any medium or format. - Adapt - remix, transform, and build upon the material for any purpose, even commercially. - You must: - Give appropriate credit - Provide a link to the license, and indicate if changes were made. - ShareAlike - You must distribute your contributions under the same license as the original. - Keep intact - all notices that refer to this license, including copyright notices. - No Derivatives - If you remix, transform, or build upon the material, you may not distribute the modified material. - No additional restrictions - You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.

Columns

File: crash-data-3.csv | Column name | Description | |:--------------|:-----------------------------------------------------------------------------------------------------| | type | The type of crash, such as single-vehicle, multi-vehicle, or pedestrian. (String) | | features | The features of the crash, such as location, contributing factors, vehicle types, and more. (String) |

File: crash-data-1.csv | Column name | Description | |:-------------------------|:----------...

Facebook

Twitter

Click to copy link

Link copied

Cite

Benj Petre; Aurore Coince; Sophien Kamoun (2016). Petre_Slide_CategoricalScatterplotFigShare.pptx [Dataset]. http://doi.org/10.6084/m9.figshare.3840102.v1

Petre_Slide_CategoricalScatterplotFigShare.pptx

Explore at:

pptxAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.3840102.v1

Dataset updated

Sep 19, 2016

Dataset provided by

Figsharehttp://figshare.com/

Authors

Benj Petre; Aurore Coince; Sophien Kamoun

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Categorical scatterplots with R for biologists: a step-by-step guide

Benjamin Petre1, Aurore Coince2, Sophien Kamoun1

1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK

Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.

Protocol

• Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.

• Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.

• Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.

Notes

• Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.

• Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.

7 Display the graph in a separate window. Dot colors indicate

replicates

graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()

References

Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.

Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035

Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128

https://cran.r-project.org/

http://ggplot2.org/

Clear search

Close search

Google apps

Main menu

Petre_Slide_CategoricalScatterplotFigShare.pptx

7 Display the graph in a separate window. Dot colors indicate

বাংলা সন্দেহজনক মন্তব্যের ডাটাসেট (Suspicious)

suspicious

Bangla_Text

Detection

unsuspicious

PSYCHE-D: predicting change in depression severity using person-generated...

Data from: GoiEner smart meters data

Citywide Nonprofit Spending

GRACEnet: GHG Emissions, C Sequestration and

GRACEnet: GHG Emissions, C Sequestration and Environmental Benefits

Quantifying Climate Change Mitigation and Sustainable Agricultural Practices

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

East Atlantic SWAN Wave Model Significant Wave

East Atlantic SWAN Wave Model Significant Wave Height

6 Day Forecast Using NCEP GFS Wind Forcing

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Dataset of author, BNB id, book publisher, and publication date of Step back...

Respiration_chambers/raw_log_files and combined datasets of biomass and...

Data from: PatagoniaMet: A multi-source hydrometeorological dataset for...

Supplement 1. Data used to simulate dose-response, broken-stick,...

US Means of Transportation to Work Census Data

Data Ingestion and Cleaning:

Data Source:

Final Word:

Data from: WP3_T3.2_BRA_EU_trade_sensitivity_equivalents_estimates

Aging dataset LCO battery with mechanical measurements

Textual Dataset of Articles From WOS and Scopus

Divvy Bike Share Analysis

Business Task:

Key Identifiers for the Case Study:

Key Stakeholders:

File Description

Column Descriptions:

Steps Taken:

Excel Cleaning Steps

SQL Data Preparation Steps

*...

Great Lakes Basin Integrated Nutrient Dataset (2000-2019)

Mortality Statistics in US Cities

Mortality Statistics in US Cities

Deaths by Age and Cause of Death in 2016

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Weather Seattle

The dataset contains the following features:

Steps to Analyze the Dataset:

1. Data Preprocessing:

2. Feature Selection:

3. Exploratory Data Analysis:

4. Model Selection for Prediction:

5. Model Training and Evaluation:

6. Model Deployment:

Exploring This Dataset Can Help With:

Cary, NC Crash Data

Cary, NC Crash Data

Injuries, Fatalities, and Contributing Factors

About this dataset

More Datasets

Featured Notebooks