This child item describes R code used to determine public supply consumptive use estimates. Consumptive use was estimated by scaling an assumed fraction of deliveries used for outdoor irrigation by spatially explicit estimates of evaporative demand using estimated domestic and commercial, industrial, and institutional deliveries from the public supply delivery machine learning model child item. This method scales public supply water service area outdoor water use by the relationship between service area gross reference evapotranspiration provided by GridMET and annual continental U.S. (CONUS) growing season maximum evapotranspiration. This relationship to climate at the CONUS scale could result in over- or under-estimation of consumptive use at public supply service areas where local variations differ from national variations in climate. This method also assumes that 50% of deliveries for total domestic and commercial, industrial, and institutional deliveries is used for outdoor purposes. This dataset is part of a larger data release using machine learning to predict public supply water use for 12-digit hydrologic units from 2000-2020. This page includes the following file: PS_ConsumptiveUse.zip - a zip file containing input datasets, scripts, and output datasets
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
PublicationPrimahadi Wijaya R., Gede. 2014. Visualisation of diachronic constructional change using Motion Chart. In Zane Goebel, J. Herudjati Purwoko, Suharno, M. Suryadi & Yusuf Al Aried (eds.). Proceedings: International Seminar on Language Maintenance and Shift IV (LAMAS IV), 267-270. Semarang: Universitas Diponegoro. doi: https://doi.org/10.4225/03/58f5c23dd8387Description of R codes and data files in the repositoryThis repository is imported from its GitHub repo. Versioning of this figshare repository is associated with the GitHub repo's Release. So, check the Releases page for updates (the next version is to include the unified version of the codes in the first release with the tidyverse).The raw input data consists of two files (i.e. will_INF.txt and go_INF.txt). They represent the co-occurrence frequency of top-200 infinitival collocates for will and be going to respectively across the twenty decades of Corpus of Historical American English (from the 1810s to the 2000s).These two input files are used in the R code file 1-script-create-input-data-raw.r. The codes preprocess and combine the two files into a long format data frame consisting of the following columns: (i) decade, (ii) coll (for "collocate"), (iii) BE going to (for frequency of the collocates with be going to) and (iv) will (for frequency of the collocates with will); it is available in the input_data_raw.txt. Then, the script 2-script-create-motion-chart-input-data.R processes the input_data_raw.txt for normalising the co-occurrence frequency of the collocates per million words (the COHA size and normalising base frequency are available in coha_size.txt). The output from the second script is input_data_futurate.txt.Next, input_data_futurate.txt contains the relevant input data for generating (i) the static motion chart as an image plot in the publication (using the script 3-script-create-motion-chart-plot.R), and (ii) the dynamic motion chart (using the script 4-script-motion-chart-dynamic.R).The repository adopts the project-oriented workflow in RStudio; double-click on the Future Constructions.Rproj file to open an RStudio session whose working directory is associated with the contents of this repository.
This child item describes R code used to determine water source fractions (groundwater (GW), surface water (SW), or spring (SP)) for public-supply water service areas, counties, and 12-digit hydrologic unit codes (HUC12) using information from a proprietary dataset from the U.S. Environmental Protection Agency. Water-use volumes per source were not available from public-supply systems so water source fractions were calculated by the number of withdrawal source types (GW/SW). For example, for a public supply system with three SW intakes and one GW well, the fractions would be 0.75 SW and 0.25 GW. This dataset is part of a larger data release using machine learning to predict public supply water use for 12-digit hydrologic units from 2000-2020. Output from this code was used to calculate groundwater and surface water volumes by HUC12 for public supply. This page includes the following files: FCL_Data_Water_Sources_Flagged_wHUC_DR.R - an R script used to determine water source fractions by public-supply water service areas, counties, and HUC12s WaterSource_readme.txt - a README text file describing the script County_SourceFrac.csv - a csv file with estimated water source fractions by county HUC12_SourceFrac.csv - a csv file with estimated water source fractions by HUC12 WSA_AGIDF_SourceFrac.csv - a csv file with estimated water source fractions by public-supply water service area
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
The files include SPSS datasets and R Markdown Notebooks (codes and outputs) for the paper titled "Understanding Regional Cultural-Psychological Variations and Their Sources: A Large-Scale Examination in China". Specifically, "Data_Main.zip" contains compressed SPSS datasets (scale-level scores) for the main analyses as reported in the main text. "Data_All.zip" contains compressed SPSS datasets (item-level scores) for all respondents who passed the quality check. And "Codes_Outputs.zip" contains compressed R Markdown documents for the main analyses.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”
A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org
Please cite this when using the dataset.
Detailed description of the dataset:
1 Film Dataset: Festival Programs
The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.
The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.
The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.
The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.
2 Survey Dataset
The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.
The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.
The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.
The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.
3 IMDb & Scripts
The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.
The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.
The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.
The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.
The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.
The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.
The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.
The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.
The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.
The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.
The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.
The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.
The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.
The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.
The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.
The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.
The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.
The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.
The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.
4 Festival Library Dataset
The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.
The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories,
https://www.nist.gov/open/licensehttps://www.nist.gov/open/license
Supporting datasets and algorithms (R-based) for the manuscript entitled "Development of a Tool to Determine the Variability of Consensus Mass Spectra", including an R Markdown script to reproduce the manuscript's figures.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset brings to you Iris Dataset in several data formats (see more details in the next sections).
You can use it to test the ingestion of data in all these formats using Python or R libraries. We also prepared Python Jupyter Notebook and R Markdown report that input all these formats:
Iris Dataset was created by R. A. Fisher and donated by Michael Marshall.
Repository on UCI site: https://archive.ics.uci.edu/ml/datasets/iris
Data Source: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/
The file downloaded is iris.data and is formatted as a comma delimited file.
This small data collection was created to help you test your skills with ingesting various data formats.
This file was processed to convert the data in the following formats:
* csv - comma separated values format
* tsv - tab separated values format
* parquet - parquet format
* feather - feather format
* parquet.gzip - compressed parquet format
* h5 - hdf5 format
* pickle - Python binary object file - pickle format
* xslx - Excel format
* npy - Numpy (Python library) binary format
* npz - Numpy (Python library) binary compressed format
* rds - Rds (R specific data format) binary format
I would like to acknowledge the work of the creator of the dataset - R. A. Fisher and of the donor - Michael Marshall.
Use these data formats to test your skills in ingesting data in various formats.
AutoTrain Dataset for project: test
Dataset Description
This dataset has been automatically processed by AutoTrain for project test.
Languages
The BCP-47 code for the dataset's language is en.
Dataset Structure
Data Instances
A sample from this dataset looks as follows: [ { "feat_id": "13829542", "text": "Kasia: When are u coming back?\r Matt: Back where?\r Kasia: Oh come on\r Kasia: you know what i… See the full description on the dataset page: https://huggingface.co/datasets/j23349/autotrain-data-test.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Intellectual Property Government Open Data (IPGOD) includes over 100 years of registry data on all intellectual property (IP) rights administered by IP Australia. It also has derived information about the applicants who filed these IP rights, to allow for research and analysis at the regional, business and individual level. This is the 2019 release of IPGOD.\r \r \r
IPGOD is large, with millions of data points across up to 40 tables, making them too large to open with Microsoft Excel. Furthermore, analysis often requires information from separate tables which would need specialised software for merging. We recommend that advanced users interact with the IPGOD data using the right tools with enough memory and compute power. This includes a wide range of programming and statistical software such as Tableau, Power BI, Stata, SAS, R, Python, and Scalar.\r \r \r
IP Australia is also providing free trials to a cloud-based analytics platform with the capabilities to enable working with large intellectual property datasets, such as the IPGOD, through the web browser, without any installation of software. IP Data Platform\r \r
\r The following pages can help you gain the understanding of the intellectual property administration and processes in Australia to help your analysis on the dataset.\r \r * Patents\r * Trade Marks\r * Designs\r * Plant Breeder’s Rights\r \r \r
\r
\r Due to the changes in our systems, some tables have been affected.\r \r * We have added IPGOD 225 and IPGOD 325 to the dataset!\r * The IPGOD 206 table is not available this year.\r * Many tables have been re-built, and as a result may have different columns or different possible values. Please check the data dictionary for each table before use.\r \r
\r Data quality has been improved across all tables.\r \r * Null values are simply empty rather than '31/12/9999'.\r * All date columns are now in ISO format 'yyyy-mm-dd'.\r * All indicator columns have been converted to Boolean data type (True/False) rather than Yes/No, Y/N, or 1/0.\r * All tables are encoded in UTF-8.\r * All tables use the backslash \ as the escape character.\r * The applicant name cleaning and matching algorithms have been updated. We believe that this year's method improves the accuracy of the matches. Please note that the "ipa_id" generated in IPGOD 2019 will not match with those in previous releases of IPGOD.
https://data.gov.cz/zdroj/datové-sady/00216208/88a8c777bba57747498f6671020dedd2/distribuce/b7e73a5c2f2635b405e5be40dd17d362/podmínky-užitíhttps://data.gov.cz/zdroj/datové-sady/00216208/88a8c777bba57747498f6671020dedd2/distribuce/b7e73a5c2f2635b405e5be40dd17d362/podmínky-užití
https://data.gov.cz/zdroj/datové-sady/00216208/88a8c777bba57747498f6671020dedd2/distribuce/974dc2a3d80ee61893c30d710738b059/podmínky-užitíhttps://data.gov.cz/zdroj/datové-sady/00216208/88a8c777bba57747498f6671020dedd2/distribuce/974dc2a3d80ee61893c30d710738b059/podmínky-užití
Check actions of the Czech Trade Inspection Authority, their focus and issued fines. Transformed to RDF from the datasets https://data.gov.cz/zdroj/datové-sady/00020869/98803, https://data.gov.cz/zdroj/datové-sady/00020869/99266 and https://data.gov.cz/zdroj/datové-sady/00020869/98719
This child item describes R code used to determine whether public-supply water systems buy water, sell water, both buy and sell water, or are neutral (meaning the system has only local water supplies) using water source information from a proprietary dataset from the U.S. Environmental Protection Agency. This information was needed to better understand public-supply water use and where water buying and selling were likely to occur. Buying or selling of water may result in per capita rates that are not representative of the population within the water service area. This dataset is part of a larger data release using machine learning to predict public supply water use for 12-digit hydrologic units from 2000-2020. Output from this code was used as an input feature variable in the public supply water use machine learning model. This page includes the following files: ID_WSA_04062022_Buyers_Sellers_DR.R - an R script used to determine whether a public-supply water service area buys water, sells water, or is neutral BuySell_readme.txt - a README text file describing the script
This data package includes the underlying data and files to replicate the calculations, charts, and tables presented in Reality Check for the Global Economy, PIIE Briefing 16-3.
If you use the data, please cite as: Acalin, Julien, Olivier Blanchard, Monica de Bolle, José De Gregorio, Caroline Freund, Joseph E. Gagnon, Nicholas R. Lardy, Adam S. Posen, David J. Stockton, and Nicolas Véron. (2016). Reality Check for the Global Economy. PIIE Briefing 16-3. Peterson Institute for International Economics.
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
The XBT/CTD pairs dataset (Version 1) is the dataset used to calculate the historical XBT fall rate and temperature corrections presented in Cowley, R., Wijffels, S., Cheng, L., Boyer, T., and Kizu, S. (2013). Biases in Expendable Bathythermograph Data: A New View Based on Historical Side-by-Side Comparisons. Journal of Atmospheric and Oceanic Technology, 30, 1195–1225, doi:10.1175/JTECH-D-12-00127.1. http://journals.ametsoc.org/doi/abs/10.1175/JTECH-D-12-00127.1
4,115 pairs from 114 datasets were used to derive the fall rate and temperature corrections. Each dataset contains the scientifically quality controlled version and (where available) the originator's data. The XBT/CTD pairs are identified in the document 'XBT_CTDpairs_metadata_V1.csv'. Note that future versions of the XBT/CTD pairs database may supersede this version. Please check more recent versions for updates to individual datasets. Lineage: Data is sourced from the World Ocean Database, NOAA, CSIRO Marine and Atmospheric Research, Bundesamt für Seeschifffahrt und Hydrographie (BSH), Hamburg, Germany, Australian Antarctic Division. Original and raw data files are included where available. Quality controlled datasets follow the procedure of Bailey, R., Gronell, A., Phillips, H., Tanner, E., and Meyers, G. (1994). Quality control cookbook for XBT data, Version 1.1. CSIRO Marine Laboratories Reports, 221. Quality controlled data is in the 'MQNC' format used at CSIRO Marine and Atmospheric Research. The MQNC format is described in the document 'XBT_CTDpairs_descriptionV1.pdf'. Note that future versions of the XBT/CTD pairs database may supersede this version. Please check more recent versions for updates to individual datasets.
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
\r From 11 March 2025, the dataset will be updated to include 1 new field, Date of Deregistration, (see help file for details). \r \r
\r From 7 August 2018, the Company dataset will be updated weekly every Tuesday. As a result, the information might not be accurate at the time you check the Company dataset.\r ASIC-Connect updates information in real time, therefore, please consider accessing information on that platform if you need up to date information.\r \r ***\r \r
ASIC is Australia’s corporate, markets and financial services regulator. ASIC contributes to Australia’s economic reputation and wellbeing by ensuring that Australia’s financial markets are fair and transparent, supported by confident and informed investors and consumers.\r \r Australian companies are required to keep their details up to date on ASIC's Company Register. Information contained in the register is made available to the public to search via ASIC's website.\r \r Select data from the ASIC's Company Register will be uploaded each week to www.data.gov.au. The data made available will be a snapshot of the register at a point in time. Legislation prescribes the type of information ASIC is allowed to disclose to the public.\r \r The information included in the downloadable dataset is:\r \r * Company Name\r * Australian Company Number (ACN)\r * Type \r * Class\r * Sub Class \r * Status\r * Date of Registration\r * Date of Deregistration (Available from 11 March 2025)\r * Previous State of Registration (where applicable)\r * State Registration Number (where applicable) \r * Modified since last report – flag to indicate if data has been modified since last report\r * Current Name Indicator\r * Australian Business Number (ABN) \r * Current Name\r * Current Name Start Date\r \r Additional information about companies can be found via ASIC's website. Accessing some information may attract a fee.\r \r More information about searching ASIC's registers.\r
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract The dataset was derived by the Bioregional Assessment Programme from multiple datasets. The source dataset is identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement. The dataset consists of an excel spreadsheet and shapefile representing the locations of simulation nodes used in the AWRA-R model. Some of the nodes correspond to gauging station locations or dam …Show full descriptionAbstract The dataset was derived by the Bioregional Assessment Programme from multiple datasets. The source dataset is identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement. The dataset consists of an excel spreadsheet and shapefile representing the locations of simulation nodes used in the AWRA-R model. Some of the nodes correspond to gauging station locations or dam locations whereas other locations represent river confluences or catchment outlets which have no gauging. These are marked as "Dummy". Purpose Locations are used as pour points in oder to define reach areas for river system modelling. Dataset History Subset of data for the Hunter that was extracted from the Bureau of Meteorology's hydstra system and includes all gauges where data has been received from the lead water agency of each jurisdiction. Simulation nodes were added in locations in which the model will provide simulated streamflow. There are 3 files that have been extracted from the Hydstra database to aid in identifying sites in each bioregion and the type of data collected from each on. These data were used to determine the simulation node locations where model outputs were generated. The 3 files contained within the source dataset used for this determination are: Site - lists all sites available in Hydstra from data providers. The data provider is listed in the #Station as _xxx. For example, sites in NSW are _77, QLD are _66. Some sites do not have locational information and will not be able to be plotted. Period - the period table lists all the variables that are recorded at each site and the period of record. Variable - the variable table shows variable codes and names which can be linked to the period table. Relevant location information and other data were extracted to construct the spreadsheet and shapefile within this dataset. Dataset Citation Bioregional Assessment Programme (XXXX) HUN AWRA-R simulation nodes v01. Bioregional Assessment Derived Dataset. Viewed 13 March 2019, http://data.bioregionalassessments.gov.au/dataset/fda20928-d486-49d2-b362-e860c1918b06. Dataset Ancestors Derived From National Surface Water sites Hydstra
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The datasets presented here were partially used in “Formulation and MIP-heuristics for the lot sizing and scheduling problem with temporal cleanings” (Toscano, A., Ferreira, D. , Morabito, R. , Computers & Chemical Engineering) [1], in “A decomposition heuristic to solve the two-stage lot sizing and scheduling problem with temporal cleaning” (Toscano, A., Ferreira, D. , Morabito, R. , Flexible Services and Manufacturing Journal) [2], and in “A heuristic approach to optimize the production scheduling of fruit-based beverages” (Toscano et al., Gestão & Produção, 2020) [3]. In fruit-based production processes, there are two production stages: preparation tanks and production lines. This production process has some process-specific characteristics, such as temporal cleanings and synchrony between the two production stages, which make optimized production planning and scheduling even more difficult. In this sense, some papers in the literature have proposed different methods to solve this problem. To the best of our knowledge, there are no standard datasets used by researchers in the literature in order to verify the accuracy and performance of proposed methods or to be a benchmark for other researchers considering this problem. The authors have been using small data sets that do not satisfactorily represent different scenarios of production. Since the demand in the beverage sector is seasonal, a wide range of scenarios enables us to evaluate the effectiveness of the proposed methods in the scientific literature in solving real scenarios of the problem. The datasets presented here include data based on real data collected from five beverage companies. We presented four datasets that are specifically constructed assuming a scenario of restricted capacity and balanced costs. These dataset is supplementary data for the submitted paper to Data in Brief [4]. [1] Toscano, A., Ferreira, D., Morabito, R., Formulation and MIP-heuristics for the lot sizing and scheduling problem with temporal cleanings, Computers & Chemical Engineering. 142 (2020) 107038. Doi: 10.1016/j.compchemeng.2020.107038. [2] Toscano, A., Ferreira, D., Morabito, R., A decomposition heuristic to solve the two-stage lot sizing and scheduling problem with temporal cleaning, Flexible Services and Manufacturing Journal. 31 (2019) 142-173. Doi: 10.1007/s10696-017-9303-9. [3] Toscano, A., Ferreira, D., Morabito, R., Trassi, M. V. C., A heuristic approach to optimize the production scheduling of fruit-based beverages. Gestão & Produção, 27(4), e4869, 2020. https://doi.org/10.1590/0104-530X4869-20. [4] Piñeros, J., Toscano, A., Ferreira, D., Morabito, R., Datasets for lot sizing and scheduling problems in the fruit-based beverage production process. Data in Brief (2021).
Interannual differences in the water quality of Anvil Lake, WI, were examined to determine how water level and climate affect the hydrodynamics and trophic state of shallow lakes, and their importance compared to anthropogenic changes in the watershed. To determine how changes in water level may affect these processes, the General Lake Model (GLM) was used to simulate how the lake’s thermal structure should change in response to changes in water level using R. This dataset includes the data inputs to the GLM model and the direct outputs from the model. Model Calibration (GLM_CalibrationZ); Simulation of with Deep Lake and Cold Weather (GLM_Deep_Cold_SimulationZ); Simulation of with Deep Lake and Hot Weather (GLM_Deep_Hot_SimulationZ); Simulation of with Shallow and Hot Weather (GLM_Shallow_Hot_SimulationZ); Simulation of with Shallow Lake and Cold Weather (GLM_Shallow_Cold_SimulationZ).
Phase 1: ASK
1. Business Task * Cyclist is looking to increase their earnings, and wants to know if creating a social media campaign can influence "Casual" users to become "Annual" members.
2. Key Stakeholders: * The main stakeholder from Cyclist is Lily Moreno, whom is the Director of Marketing and responsible for the development of campaigns and initiatives to promote their bike-share program. The other teams involved with this project will be Marketing & Analytics, and the Executive Team.
3. Business Task: * Comparing the two kinds of users and defining how they use the platform, what variables they have in common, what variables are different, and how can they get Casual users to become Annual members
Phase 2: PREPARE:
1. Determine Data Credibility * Cyclist provided data from years 2013-2021 (through March 2021), all of which is first-hand data collected by the company.
2. Sort & Filter Data: * The stakeholders want to know how the current users are using their service, so I am focusing on using the data from 2020-2021 since this is the most relevant period of time to answer the business task.
#Installing packages
install.packages("tidyverse", repos = "http://cran.us.r-project.org")
install.packages("readr", repos = "http://cran.us.r-project.org")
install.packages("janitor", repos = "http://cran.us.r-project.org")
install.packages("geosphere", repos = "http://cran.us.r-project.org")
install.packages("gridExtra", repos = "http://cran.us.r-project.org")
library(tidyverse)
library(readr)
library(janitor)
library(geosphere)
library(gridExtra)
#Importing data & verifying the information within the dataset
all_tripdata_clean <- read.csv("/Data Projects/cyclist/cyclist_data_cleaned.csv")
glimpse(all_tripdata_clean)
summary(all_tripdata_clean)
Phase 3: PROCESS
1. Cleaning Data & Preparing for Analysis: * Once the data has been placed into one dataset, and checked for errors, we began cleaning the data. * Eliminating data that correlates to the company servicing the bikes, and any ride with a traveled distance of zero. * New columns will be added to assist in the analysis, and to provide accurate assessments of whom is using the bikes.
#Eliminating any data that represents the company performing maintenance, and trips without any measureable distance
all_tripdata_clean <- all_tripdata_clean[!(all_tripdata_clean$start_station_name == "HQ QR" | all_tripdata_clean$ride_length<0),]
#Creating columns for the individual date components (days_of_week should be run last)
all_tripdata_clean$day_of_week <- format(as.Date(all_tripdata_clean$date), "%A")
all_tripdata_clean$date <- as.Date(all_tripdata_clean$started_at)
all_tripdata_clean$day <- format(as.Date(all_tripdata_clean$date), "%d")
all_tripdata_clean$month <- format(as.Date(all_tripdata_clean$date), "%m")
all_tripdata_clean$year <- format(as.Date(all_tripdata_clean$date), "%Y")
** Now I will begin calculating the length of rides being taken, distance traveled, and the mean amount of time & distance.**
#Calculating the ride length in miles & minutes
all_tripdata_clean$ride_length <- difftime(all_tripdata_clean$ended_at,all_tripdata_clean$started_at,units = "mins")
all_tripdata_clean$ride_distance <- distGeo(matrix(c(all_tripdata_clean$start_lng, all_tripdata_clean$start_lat), ncol = 2), matrix(c(all_tripdata_clean$end_lng, all_tripdata_clean$end_lat), ncol = 2))
all_tripdata_clean$ride_distance = all_tripdata_clean$ride_distance/1609.34 #converting to miles
#Calculating the mean time and distance based on the user groups
userType_means <- all_tripdata_clean %>% group_by(member_casual) %>% summarise(mean_time = mean(ride_length))
userType_means <- all_tripdata_clean %>%
group_by(member_casual) %>%
summarise(mean_time = mean(ride_length),mean_distance = mean(ride_distance))
Adding in calculations that will differentiate between bike types and which type of user is using each specific bike type.
#Calculations
with_bike_type <- all_tripdata_clean %>% filter(rideable_type=="classic_bike" | rideable_type=="electric_bike")
with_bike_type %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(member_casual,rideable_type,weekday) %>%
summarise(totals=n(), .groups="drop") %>%
with_bike_type %>%
group_by(member_casual,rideable_type) %>%
summarise(totals=n(), .groups="drop") %>%
#Calculating the ride differential
all_tripdata_clean %>%
mutate(weekday = wkday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length),.groups = 'drop') %>%
arrange(me...
analyze the current population survey (cps) annual social and economic supplement (asec) with r the annual march cps-asec has been supplying the statistics for the census bureau's report on income, poverty, and health insurance coverage since 1948. wow. the us census bureau and the bureau of labor statistics ( bls) tag-team on this one. until the american community survey (acs) hit the scene in the early aughts (2000s), the current population survey had the largest sample size of all the annual general demographic data sets outside of the decennial census - about two hundred thousand respondents. this provides enough sample to conduct state- and a few large metro area-level analyses. your sample size will vanish if you start investigating subgroups b y state - consider pooling multiple years. county-level is a no-no. despite the american community survey's larger size, the cps-asec contains many more variables related to employment, sources of income, and insurance - and can be trended back to harry truman's presidency. aside from questions specifically asked about an annual experience (like income), many of the questions in this march data set should be t reated as point-in-time statistics. cps-asec generalizes to the united states non-institutional, non-active duty military population. the national bureau of economic research (nber) provides sas, spss, and stata importation scripts to create a rectangular file (rectangular data means only person-level records; household- and family-level information gets attached to each person). to import these files into r, the parse.SAScii function uses nber's sas code to determine how to import the fixed-width file, then RSQLite to put everything into a schnazzy database. you can try reading through the nber march 2012 sas importation code yourself, but it's a bit of a proc freak show. this new github repository contains three scripts: 2005-2012 asec - download all microdata.R down load the fixed-width file containing household, family, and person records import by separating this file into three tables, then merge 'em together at the person-level download the fixed-width file containing the person-level replicate weights merge the rectangular person-level file with the replicate weights, then store it in a sql database create a new variable - one - in the data table 2012 asec - analysis examples.R connect to the sql database created by the 'download all microdata' progr am create the complex sample survey object, using the replicate weights perform a boatload of analysis examples replicate census estimates - 2011.R connect to the sql database created by the 'download all microdata' program create the complex sample survey object, using the replicate weights match the sas output shown in the png file below 2011 asec replicate weight sas output.png statistic and standard error generated from the replicate-weighted example sas script contained in this census-provided person replicate weights usage instructions document. click here to view these three scripts for more detail about the current population survey - annual social and economic supplement (cps-asec), visit: the census bureau's current population survey page the bureau of labor statistics' current population survey page the current population survey's wikipedia article notes: interviews are conducted in march about experiences during the previous year. the file labeled 2012 includes information (income, work experience, health insurance) pertaining to 2011. when you use the current populat ion survey to talk about america, subract a year from the data file name. as of the 2010 file (the interview focusing on america during 2009), the cps-asec contains exciting new medical out-of-pocket spending variables most useful for supplemental (medical spending-adjusted) poverty research. confidential to sas, spss, stata, sudaan users: why are you still rubbing two sticks together after we've invented the butane lighter? time to transition to r. :D
The dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.
AWRA-R model implementation.
The metadata within the dataset contains the workflow, processes, input and output data and instructions to implement the Namoi AWRA-R model for model calibration or simulation. In Namoi the AWRA-R simulation was done twice, firstly without the baseflow input from the groundwater modelling and then second set of runs were carried out with the input from the groundwater modelling.
Each sub-folder in the associated data has a readme file indicating folder contents and providing general instructions about the workflow performed.
Detailed documentation of the AWRA-R model, is provided in: https://publications.csiro.au/rpr/download?pid=csiro:EP154523&dsid=DS2
Documentation about the implementation of AWRA-R in the Namoi bioregion is provided in BA NAM 2.6.1.3 and 2.6.1.4 products.
'..\AWRAR_Metadata_WGW\Namoi_Model_Sequence.pptx' shows the AWRA-L/R modelling sequence.
BA Surface water modelling for Namoi bioregion
The directories within contain the input data and outputs of the Namoi AWRA-R model for model calibration and simulation. The folders of calibration is used as an example,simulation uses mirror files of these data, albeit with longer time-series depending on the simualtion period.
Detailed documentation of the AWRA-R model, is provided in: https://publications.csiro.au/rpr/download?pid=csiro:EP154523&dsid=DS2
Documentation about the implementation of AWRA-R in the Hunter bioregion is provided in BA NAM 2.6.1.3 and 2.6.1.4 products.
Additional data needed to generate some of the inputs needed to implement AWRA-R are detailed in the corresponding metadata statement as stated below.
Here is the parent folder:
'..\AWRAR_Metadata_WGW..'
Input data needed:
Gauge/node topological information in '...\model calibration\NAM5.3.1_low_calib\gis\sites\AWRARv5.00_reaches.csv'.
Look up table for soil thickness in '...\model calibration\NAM5.3.1_low_calib\gis\ASRIS_soil_properties\NAM_AWRAR_ASRIS_soil_thickness_v5.00.csv'. (check metadata statement)
Look up tables of AWRA-LG groundwater parameters in '...\model calibration\NAM5.3.1_low_calib\gis\AWRA-LG_gw_parameters\'.
Look up table of AWRA-LG catchment grid cell contribution in '...model calibration\NAM5.3.1_low_calib\gis\catchment-boundary\AWRA-R_catchment_x_AWRA-L_weight.csv'. (check metadata statement)
Look up tables of link lengths for main river, tributaries and distributaries within a reach in \model calibration\NAM5.3.1_low_calib\gis\rivers\'. (check metadata statement)
Time series data of AWRA-LG outputs: evaporation, rainfall, runoff and depth to groundwater.
Gridded data of AWRA-LG groundwater parameters, refer to explanation in '...'\model calibration\NAM5.3.1_low_calib\rawdata\AWRA_LG_output\gw_parameters\README.txt'.
Time series of observed or simulated reservoir level, volume and surface area for reservoirs used in the simulation: Keepit Dam, Split Rock and Chaffey Creek Dam.
located in '...\model calibration\NAM5.3.1_low_calib\rawdata\reservoirs\'.
Gauge station cross sections in '...\model calibration\NAM5.3.1_low_calib\rawdata\Site_Station_Sections\'. (check metadata statement)
Daily Streamflow and level time-series in'...\model calibration\NAM5.3.1_low_calib\rawdata\streamflow_and_level_all_processed\'.
Irrigation input, configuration and parameter files in '...\model calibration\NAM5.3.1_low_calib\inputs\NAM\irrigation\'.
These come from the separate calibration of the AWRA-R irrigation module in:
'...\irrigation calibration simulation\', refer to explanation in readme.txt file therein.
' ..\AWRAR_Metadata_WGW\dam model calibration simulation\Chaffey\readme.txt'
'... \AWRAR_Metadata_WGW\dam model calibration simulation\Split_Rock_and_Keepit\readme.txt'
Relevant ouputs include:
AWRA-R time series of stores and fluxes in river reaches ('...\AWRAR_Metadata_WGW\model calibration\NAM5.3.1_low_calib\outputs\jointcalibration\v00\NAM\simulations\')
including simulated streamflow in files denoted XXXXXX_full_period_states_nonrouting.csv where XXXXXX denotes gauge or node ID.
AWRA-R time series of stores and fluxes for irrigation/mining in the same directory as above in files XXXXXX_irrigation_states.csv
AWRA-R calibration validation goodness of fit metrics ('...\AWRAR_Metadata_WGW\model calibration\NAM5.3.1_low_calib\outputs\jointcalibration\v00\NAM\postprocessing\')
in files calval_results_XXXXXX_v5.00.csv
Bioregional Assessment Programme (2017) Namoi AWRA-R model implementation (post groundwater input). Bioregional Assessment Derived Dataset. Viewed 12 March 2019, http://data.bioregionalassessments.gov.au/dataset/8681bd56-1806-40a8-892e-4da13cda86b8.
Derived From Historical Mining Footprints DTIRIS NAM 20150914
Derived From GEODATA 9 second DEM and D8: Digital Elevation Model Version 3 and Flow Direction Grid 2008
Derived From Namoi Environmental Impact Statements - Mine footprints
Derived From Namoi Surface Water Mine Footprints - digitised
Derived From River Styles Spatial Layer for New South Wales
Derived From National Surface Water sites Hydstra
Derived From Namoi AWRA-L model
Derived From Namoi Hydstra surface water time series v1 extracted 140814
Derived From Namoi AWRA-R (restricted input data implementation)
Derived From Namoi Existing Mine Development Surface Water Footprints
This child item describes R code used to determine public supply consumptive use estimates. Consumptive use was estimated by scaling an assumed fraction of deliveries used for outdoor irrigation by spatially explicit estimates of evaporative demand using estimated domestic and commercial, industrial, and institutional deliveries from the public supply delivery machine learning model child item. This method scales public supply water service area outdoor water use by the relationship between service area gross reference evapotranspiration provided by GridMET and annual continental U.S. (CONUS) growing season maximum evapotranspiration. This relationship to climate at the CONUS scale could result in over- or under-estimation of consumptive use at public supply service areas where local variations differ from national variations in climate. This method also assumes that 50% of deliveries for total domestic and commercial, industrial, and institutional deliveries is used for outdoor purposes. This dataset is part of a larger data release using machine learning to predict public supply water use for 12-digit hydrologic units from 2000-2020. This page includes the following file: PS_ConsumptiveUse.zip - a zip file containing input datasets, scripts, and output datasets