36 datasets found
  1. Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

    • zenodo.org
    application/gzip, bin +2
    Updated Aug 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb (2024). Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem [Dataset]. http://doi.org/10.5281/zenodo.1419788
    Explore at:
    bin, application/gzip, zip, text/x-pythonAvailable download formats
    Dataset updated
    Aug 2, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb
    License

    https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html

    Description
    Replication pack, FSE2018 submission #164:
    ------------------------------------------
    
    **Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: 
    A Case Study of the PyPI Ecosystem
    
    **Note:** link to data artifacts is already included in the paper. 
    Link to the code will be included in the Camera Ready version as well.
    
    
    Content description
    ===================
    
    - **ghd-0.1.0.zip** - the code archive. This code produces the dataset files 
     described below
    - **settings.py** - settings template for the code archive.
    - **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset.
     This dataset only includes stats aggregated by the ecosystem (PyPI)
    - **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level
     statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages
     themselves, which take around 2TB.
    - **build_model.r, helpers.r** - R files to process the survival data 
      (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, 
      `common.cache/survival_data.pypi_2008_2017-12_6.csv` in 
      **dataset_full_Jan_2018.tgz**)
    - **Interview protocol.pdf** - approximate protocol used for semistructured interviews.
    - LICENSE - text of GPL v3, under which this dataset is published
    - INSTALL.md - replication guide (~2 pages)
    Replication guide
    =================
    
    Step 0 - prerequisites
    ----------------------
    
    - Unix-compatible OS (Linux or OS X)
    - Python interpreter (2.7 was used; Python 3 compatibility is highly likely)
    - R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible)
    
    Depending on detalization level (see Step 2 for more details):
    - up to 2Tb of disk space (see Step 2 detalization levels)
    - at least 16Gb of RAM (64 preferable)
    - few hours to few month of processing time
    
    Step 1 - software
    ----------------
    
    - unpack **ghd-0.1.0.zip**, or clone from gitlab:
    
       git clone https://gitlab.com/user2589/ghd.git
       git checkout 0.1.0
     
     `cd` into the extracted folder. 
     All commands below assume it as a current directory.
      
    - copy `settings.py` into the extracted folder. Edit the file:
      * set `DATASET_PATH` to some newly created folder path
      * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` 
    - install docker. For Ubuntu Linux, the command is 
      `sudo apt-get install docker-compose`
    - install libarchive and headers: `sudo apt-get install libarchive-dev`
    - (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools`
     Without this dependency, you might get an error on the next step, 
     but it's safe to ignore.
    - install Python libraries: `pip install --user -r requirements.txt` . 
    - disable all APIs except GitHub (Bitbucket and Gitlab support were
     not yet implemented when this study was in progress): edit
     `scraper/init.py`, comment out everything except GitHub support
     in `PROVIDERS`.
    
    Step 2 - obtaining the dataset
    -----------------------------
    
    The ultimate goal of this step is to get output of the Python function 
    `common.utils.survival_data()` and save it into a CSV file:
    
      # copy and paste into a Python console
      from common import utils
      survival_data = utils.survival_data('pypi', '2008', smoothing=6)
      survival_data.to_csv('survival_data.csv')
    
    Since full replication will take several months, here are some ways to speedup
    the process:
    
    ####Option 2.a, difficulty level: easiest
    
    Just use the precomputed data. Step 1 is not necessary under this scenario.
    
    - extract **dataset_minimal_Jan_2018.zip**
    - get `survival_data.csv`, go to the next step
    
    ####Option 2.b, difficulty level: easy
    
    Use precomputed longitudinal feature values to build the final table.
    The whole process will take 15..30 minutes.
    
    - create a folder `
  2. Data from: Data and code from: Environmental influences on drying rate of...

    • catalog.data.gov
    • datasetcatalog.nlm.nih.gov
    • +2more
    Updated Apr 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Data and code from: Environmental influences on drying rate of spray applied disinfestants from horticultural production services [Dataset]. https://catalog.data.gov/dataset/data-and-code-from-environmental-influences-on-drying-rate-of-spray-applied-disinfestants-
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Agricultural Research Servicehttps://www.ars.usda.gov/
    Description

    This dataset includes all the data and R code needed to reproduce the analyses in a forthcoming manuscript:Copes, W. E., Q. D. Read, and B. J. Smith. Environmental influences on drying rate of spray applied disinfestants from horticultural production services. PhytoFrontiers, DOI pending.Study description: Instructions for disinfestants typically specify a dose and a contact time to kill plant pathogens on production surfaces. A problem occurs when disinfestants are applied to large production areas where the evaporation rate is affected by weather conditions. The common contact time recommendation of 10 min may not be achieved under hot, sunny conditions that promote fast drying. This study is an investigation into how the evaporation rates of six commercial disinfestants vary when applied to six types of substrate materials under cool to hot and cloudy to sunny weather conditions. Initially, disinfestants with low surface tension spread out to provide 100% coverage and disinfestants with high surface tension beaded up to provide about 60% coverage when applied to hard smooth surfaces. Disinfestants applied to porous materials were quickly absorbed into the body of the material, such as wood and concrete. Even though disinfestants evaporated faster under hot sunny conditions than under cool cloudy conditions, coverage was reduced considerably in the first 2.5 min under most weather conditions and reduced to less than or equal to 50% coverage by 5 min. Dataset contents: This dataset includes R code to import the data and fit Bayesian statistical models using the model fitting software CmdStan, interfaced with R using the packages brms and cmdstanr. The models (one for 2022 and one for 2023) compare how quickly different spray-applied disinfestants dry, depending on what chemical was sprayed, what surface material it was sprayed onto, and what the weather conditions were at the time. Next, the statistical models are used to generate predictions and compare mean drying rates between the disinfestants, surface materials, and weather conditions. Finally, tables and figures are created. These files are included:Drying2022.csv: drying rate data for the 2022 experimental runWeather2022.csv: weather data for the 2022 experimental runDrying2023.csv: drying rate data for the 2023 experimental runWeather2023.csv: weather data for the 2023 experimental rundisinfestant_drying_analysis.Rmd: RMarkdown notebook with all data processing, analysis, and table creation codedisinfestant_drying_analysis.html: rendered output of notebookMS_figures.R: additional R code to create figures formatted for journal requirementsfit2022_discretetime_weather_solar.rds: fitted brms model object for 2022. This will allow users to reproduce the model prediction results without having to refit the model, which was originally fit on a high-performance computing clusterfit2023_discretetime_weather_solar.rds: fitted brms model object for 2023data_dictionary.xlsx: descriptions of each column in the CSV data files

  3. q

    Large Datasets in R - Plant Phenology & Temperature Data from NEON

    • qubeshub.org
    Updated May 10, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Megan Jones Patterson; Lee Stanish; Natalie Robinson; Katherine Jones; Cody Flagg (2018). Large Datasets in R - Plant Phenology & Temperature Data from NEON [Dataset]. http://doi.org/10.25334/Q4DQ3F
    Explore at:
    Dataset updated
    May 10, 2018
    Dataset provided by
    QUBES
    Authors
    Megan Jones Patterson; Lee Stanish; Natalie Robinson; Katherine Jones; Cody Flagg
    Description

    This module series covers how to import, manipulate, format and plot time series data stored in .csv format in R. Originally designed to teach researchers to use NEON plant phenology and air temperature data; has been used in undergraduate classrooms.

  4. Petre_Slide_CategoricalScatterplotFigShare.pptx

    • figshare.com
    pptx
    Updated Sep 19, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benj Petre; Aurore Coince; Sophien Kamoun (2016). Petre_Slide_CategoricalScatterplotFigShare.pptx [Dataset]. http://doi.org/10.6084/m9.figshare.3840102.v1
    Explore at:
    pptxAvailable download formats
    Dataset updated
    Sep 19, 2016
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Benj Petre; Aurore Coince; Sophien Kamoun
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Categorical scatterplots with R for biologists: a step-by-step guide

    Benjamin Petre1, Aurore Coince2, Sophien Kamoun1

    1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK

    Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.

    Protocol

    • Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.

    • Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.

    • Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.

    Notes

    • Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.

    • Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.

    7 Display the graph in a separate window. Dot colors indicate

    replicates

    graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()

    References

    Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.

    Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035

    Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128

    https://cran.r-project.org/

    http://ggplot2.org/

  5. Data Mining Project - Boston

    • kaggle.com
    zip
    Updated Nov 25, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SophieLiu (2019). Data Mining Project - Boston [Dataset]. https://www.kaggle.com/sliu65/data-mining-project-boston
    Explore at:
    zip(59313797 bytes)Available download formats
    Dataset updated
    Nov 25, 2019
    Authors
    SophieLiu
    Area covered
    Boston
    Description

    Context

    To make this a seamless process, I cleaned the data and delete many variables that I thought were not important to our dataset. I then uploaded all of those files to Kaggle for each of you to download. The rideshare_data has both lyft and uber but it is still a cleaned version from the dataset we downloaded from Kaggle.

    Use of Data Files

    You can easily subset the data into the car types that you will be modeling by first loading the csv into R, here is the code for how you do this:

    This loads the file into R

    df<-read.csv('uber.csv')

    The next codes is to subset the data into specific car types. The example below only has Uber 'Black' car types.

    df_black<-subset(uber_df, uber_df$name == 'Black')

    This next portion of code will be to load it into R. First, we must write this dataframe into a csv file on our computer in order to load it into R.

    write.csv(df_black, "nameofthefileyouwanttosaveas.csv")

    The file will appear in you working directory. If you are not familiar with your working directory. Run this code:

    getwd()

    The output will be the file path to your working directory. You will find the file you just created in that folder.

    Inspiration

    Your data will be in front of the world's largest data science community. What questions do you want to see answered?

  6. Annotated 12 lead ECG dataset

    • zenodo.org
    zip
    Updated Jun 7, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antonio H Ribeiro; Antonio H Ribeiro; Manoel Horta Ribeiro; Manoel Horta Ribeiro; Gabriela M. Paixão; Gabriela M. Paixão; Derick M. Oliveira; Derick M. Oliveira; Paulo R. Gomes; Paulo R. Gomes; Jéssica A. Canazart; Jéssica A. Canazart; Milton P. Ferreira; Milton P. Ferreira; Carl R. Andersson; Carl R. Andersson; Peter W. Macfarlane; Peter W. Macfarlane; Wagner Meira Jr.; Wagner Meira Jr.; Thomas B. Schön; Thomas B. Schön; Antonio Luiz P. Ribeiro; Antonio Luiz P. Ribeiro (2021). Annotated 12 lead ECG dataset [Dataset]. http://doi.org/10.5281/zenodo.3625007
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 7, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Antonio H Ribeiro; Antonio H Ribeiro; Manoel Horta Ribeiro; Manoel Horta Ribeiro; Gabriela M. Paixão; Gabriela M. Paixão; Derick M. Oliveira; Derick M. Oliveira; Paulo R. Gomes; Paulo R. Gomes; Jéssica A. Canazart; Jéssica A. Canazart; Milton P. Ferreira; Milton P. Ferreira; Carl R. Andersson; Carl R. Andersson; Peter W. Macfarlane; Peter W. Macfarlane; Wagner Meira Jr.; Wagner Meira Jr.; Thomas B. Schön; Thomas B. Schön; Antonio Luiz P. Ribeiro; Antonio Luiz P. Ribeiro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    # Annotated 12 lead ECG dataset
    
    Contain 827 ECG tracings from different patients, annotated by several cardiologists, residents and medical students.
    It is used as test set on the paper:
    "Automatic Diagnosis of the Short-Duration12-Lead ECG using a Deep Neural Network".
    
    It contain annotations about 6 different ECGs abnormalities:
    - 1st degree AV block (1dAVb);
    - right bundle branch block (RBBB);
    - left bundle branch block (LBBB);
    - sinus bradycardia (SB);
    - atrial fibrillation (AF); and,
    - sinus tachycardia (ST).
    
    ## Folder content:
    
    - `ecg_tracings.hdf5`: HDF5 file containing a single dataset named `tracings`. This dataset is a 
    `(827, 4096, 12)` tensor. The first dimension correspond to the 827 different exams from different 
    patients; the second dimension correspond to the 4096 signal samples; the third dimension to the 12
    different leads of the ECG exam. 
    
    The signals are sampled at 400 Hz. Some signals originally have a duration of 
    10 seconds (10 * 400 = 4000 samples) and others of 7 seconds (7 * 400 = 2800 samples).
    In order to make them all have the same size (4096 samples) we fill them with zeros
    on both sizes. For instance, for a 7 seconds ECG signal with 2800 samples we include 648
    samples at the beginning and 648 samples at the end, yielding 4096 samples that are them saved
    in the hdf5 dataset. All signal are represented as floating point numbers at the scale 1e-4V: so it should
    be multiplied by 1000 in order to obtain the signals in V.
    
    In python, one can read this file using the following sequence:
    ```python
    import h5py
    with h5py.File(args.tracings, "r") as f:
      x = np.array(f['tracings'])
    ```
    
    - The file `attributes.csv` contain basic patient attributes: sex (M or F) and age. It
    contain 827 lines (plus the header). The i-th tracing in `ecg_tracings.hdf5` correspond to the i-th line.
    - `annotations/`: folder containing annotations csv format. Each csv file contain 827 lines (plus the header).
    The i-th line correspond to the i-th tracing in `ecg_tracings.hdf5` correspond to the in all csv files.
    The csv files all have 6 columns `1dAVb, RBBB, LBBB, SB, AF, ST`
    corresponding to weather the annotator have detect the abnormality in the ECG (`=1`) or not (`=0`).
     1. `cardiologist[1,2].csv` contain annotations from two different cardiologist.
     2. `gold_standard.csv` gold standard annotation for this test dataset. When the cardiologist 1 and cardiologist 2
     agree, the common diagnosis was considered as gold standard. In cases where there was any disagreement, a 
     third senior specialist, aware of the annotations from the other two, decided the diagnosis. 
     3. `dnn.csv` prediction from the deep neural network described in 
     "Automatic Diagnosis of the Short-Duration 12-Lead ECG using a Deep Neural Network". The threshold is set in such way 
     it maximizes the F1 score.
     4. `cardiology_residents.csv` annotations from two 4th year cardiology residents (each annotated half of the dataset).
     5. `emergency_residents.csv` annotations from two 3rd year emergency residents (each annotated half of the dataset).
     6. `medical_students.csv` annotations from two 5th year medical students (each annotated half of the dataset).
    
  7. Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic...

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv, zip
    Updated Dec 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexander R. Hartloper; Alexander R. Hartloper; Selimcan Ozden; Albano de Castro e Sousa; Dimitrios G. Lignos; Dimitrios G. Lignos; Selimcan Ozden; Albano de Castro e Sousa (2022). Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials [Dataset]. http://doi.org/10.5281/zenodo.6965147
    Explore at:
    bin, zip, csvAvailable download formats
    Dataset updated
    Dec 24, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alexander R. Hartloper; Alexander R. Hartloper; Selimcan Ozden; Albano de Castro e Sousa; Dimitrios G. Lignos; Dimitrios G. Lignos; Selimcan Ozden; Albano de Castro e Sousa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials

    Background

    This dataset contains data from monotonic and cyclic loading experiments on structural metallic materials. The materials are primarily structural steels and one iron-based shape memory alloy is also included. Summary files are included that provide an overview of the database and data from the individual experiments is also included.

    The files included in the database are outlined below and the format of the files is briefly described. Additional information regarding the formatting can be found through the post-processing library (https://github.com/ahartloper/rlmtp/tree/master/protocols).

    Usage

    • The data is licensed through the Creative Commons Attribution 4.0 International.
    • If you have used our data and are publishing your work, we ask that you please reference both:
      1. this database through its DOI, and
      2. any publication that is associated with the experiments. See the Overall_Summary and Database_References files for the associated publication references.

    Included Files

    • Overall_Summary_2022-08-25_v1-0-0.csv: summarises the specimen information for all experiments in the database.
    • Summarized_Mechanical_Props_Campaign_2022-08-25_v1-0-0.csv: summarises the average initial yield stress and average initial elastic modulus per campaign.
    • Unreduced_Data-#_v1-0-0.zip: contain the original (not downsampled) data
      • Where # is one of: 1, 2, 3, 4, 5, 6. The unreduced data is broken into separate archives because of upload limitations to Zenodo. Together they provide all the experimental data.
      • We recommend you un-zip all the folders and place them in one "Unreduced_Data" directory similar to the "Clean_Data"
      • The experimental data is provided through .csv files for each test that contain the processed data. The experiments are organised by experimental campaign and named by load protocol and specimen. A .pdf file accompanies each test showing the stress-strain graph.
      • There is a "db_tag_clean_data_map.csv" file that is used to map the database summary with the unreduced data.
      • The computed yield stresses and elastic moduli are stored in the "yield_stress" directory.
    • Clean_Data_v1-0-0.zip: contains all the downsampled data
      • The experimental data is provided through .csv files for each test that contain the processed data. The experiments are organised by experimental campaign and named by load protocol and specimen. A .pdf file accompanies each test showing the stress-strain graph.
      • There is a "db_tag_clean_data_map.csv" file that is used to map the database summary with the clean data.
      • The computed yield stresses and elastic moduli are stored in the "yield_stress" directory.
    • Database_References_v1-0-0.bib
      • Contains a bibtex reference for many of the experiments in the database. Corresponds to the "citekey" entry in the summary files.

    File Format: Downsampled Data

    These are the "LP_

    • The header of the first column is empty: the first column corresponds to the index of the sample point in the original (unreduced) data
    • Time[s]: time in seconds since the start of the test
    • e_true: true strain
    • Sigma_true: true stress in MPa
    • (optional) Temperature[C]: the surface temperature in degC

    These data files can be easily loaded using the pandas library in Python through:

    import pandas
    data = pandas.read_csv(data_file, index_col=0)

    The data is formatted so it can be used directly in RESSPyLab (https://github.com/AlbanoCastroSousa/RESSPyLab). Note that the column names "e_true" and "Sigma_true" were kept for backwards compatibility reasons with RESSPyLab.

    File Format: Unreduced Data

    These are the "LP_

    • The first column is the index of each data point
    • S/No: sample number recorded by the DAQ
    • System Date: Date and time of sample
    • Time[s]: time in seconds since the start of the test
    • C_1_Force[kN]: load cell force
    • C_1_Déform1[mm]: extensometer displacement
    • C_1_Déplacement[mm]: cross-head displacement
    • Eng_Stress[MPa]: engineering stress
    • Eng_Strain[]: engineering strain
    • e_true: true strain
    • Sigma_true: true stress in MPa
    • (optional) Temperature[C]: specimen surface temperature in degC

    The data can be loaded and used similarly to the downsampled data.

    File Format: Overall_Summary

    The overall summary file provides data on all the test specimens in the database. The columns include:

    • hidden_index: internal reference ID
    • grade: material grade
    • spec: specifications for the material
    • source: base material for the test specimen
    • id: internal name for the specimen
    • lp: load protocol
    • size: type of specimen (M8, M12, M20)
    • gage_length_mm_: unreduced section length in mm
    • avg_reduced_dia_mm_: average measured diameter for the reduced section in mm
    • avg_fractured_dia_top_mm_: average measured diameter of the top fracture surface in mm
    • avg_fractured_dia_bot_mm_: average measured diameter of the bottom fracture surface in mm
    • fy_n_mpa_: nominal yield stress
    • fu_n_mpa_: nominal ultimate stress
    • t_a_deg_c_: ambient temperature in degC
    • date: date of test
    • investigator: person(s) who conducted the test
    • location: laboratory where test was conducted
    • machine: setup used to conduct test
    • pid_force_k_p, pid_force_t_i, pid_force_t_d: PID parameters for force control
    • pid_disp_k_p, pid_disp_t_i, pid_disp_t_d: PID parameters for displacement control
    • pid_extenso_k_p, pid_extenso_t_i, pid_extenso_t_d: PID parameters for extensometer control
    • citekey: reference corresponding to the Database_References.bib file
    • yield_stress_mpa_: computed yield stress in MPa
    • elastic_modulus_mpa_: computed elastic modulus in MPa
    • fracture_strain: computed average true strain across the fracture surface
    • c,si,mn,p,s,n,cu,mo,ni,cr,v,nb,ti,al,b,zr,sn,ca,h,fe: chemical compositions in units of %mass
    • file: file name of corresponding clean (downsampled) stress-strain data

    File Format: Summarized_Mechanical_Props_Campaign

    Meant to be loaded in Python as a pandas DataFrame with multi-indexing, e.g.,

    tab1 = pd.read_csv('Summarized_Mechanical_Props_Campaign_' + date + version + '.csv',
              index_col=[0, 1, 2, 3], skipinitialspace=True, header=[0, 1],
              keep_default_na=False, na_values='')
    • citekey: reference in "Campaign_References.bib".
    • Grade: material grade.
    • Spec.: specifications (e.g., J2+N).
    • Yield Stress [MPa]: initial yield stress in MPa
      • size, count, mean, coefvar: number of experiments in campaign, number of experiments in mean, mean value for campaign, coefficient of variation for campaign
    • Elastic Modulus [MPa]: initial elastic modulus in MPa
      • size, count, mean, coefvar: number of experiments in campaign, number of experiments in mean, mean value for campaign, coefficient of variation for campaign

    Caveats

    • The files in the following directories were tested before the protocol was established. Therefore, only the true stress-strain is available for each:
      • A500
      • A992_Gr50
      • BCP325
      • BCR295
      • HYP400
      • S460NL
      • S690QL/25mm
      • S355J2_Plates/S355J2_N_25mm and S355J2_N_50mm
  8. Data from: A dataset to model Levantine landcover and land-use change...

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Dec 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Kempf; Michael Kempf (2023). A dataset to model Levantine landcover and land-use change connected to climate change, the Arab Spring and COVID-19 [Dataset]. http://doi.org/10.5281/zenodo.10396148
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 16, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Michael Kempf; Michael Kempf
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Dec 16, 2023
    Area covered
    Levant
    Description

    Overview

    This dataset is the repository for the following paper submitted to Data in Brief:

    Kempf, M. A dataset to model Levantine landcover and land-use change connected to climate change, the Arab Spring and COVID-19. Data in Brief (submitted: December 2023).

    The Data in Brief article contains the supplement information and is the related data paper to:

    Kempf, M. Climate change, the Arab Spring, and COVID-19 - Impacts on landcover transformations in the Levant. Journal of Arid Environments (revision submitted: December 2023).

    Description/abstract

    The Levant region is highly vulnerable to climate change, experiencing prolonged heat waves that have led to societal crises and population displacement. Since 2010, the area has been marked by socio-political turmoil, including the Syrian civil war and currently the escalation of the so-called Israeli-Palestinian Conflict, which strained neighbouring countries like Jordan due to the influx of Syrian refugees and increases population vulnerability to governmental decision-making. Jordan, in particular, has seen rapid population growth and significant changes in land-use and infrastructure, leading to over-exploitation of the landscape through irrigation and construction. This dataset uses climate data, satellite imagery, and land cover information to illustrate the substantial increase in construction activity and highlights the intricate relationship between climate change predictions and current socio-political developments in the Levant.

    Folder structure

    The main folder after download contains all data, in which the following subfolders are stored are stored as zipped files:

    “code” stores the above described 9 code chunks to read, extract, process, analyse, and visualize the data.

    “MODIS_merged” contains the 16-days, 250 m resolution NDVI imagery merged from three tiles (h20v05, h21v05, h21v06) and cropped to the study area, n=510, covering January 2001 to December 2022 and including January and February 2023.

    “mask” contains a single shapefile, which is the merged product of administrative boundaries, including Jordan, Lebanon, Israel, Syria, and Palestine (“MERGED_LEVANT.shp”).

    “yield_productivity” contains .csv files of yield information for all countries listed above.

    “population” contains two files with the same name but different format. The .csv file is for processing and plotting in R. The .ods file is for enhanced visualization of population dynamics in the Levant (Socio_cultural_political_development_database_FAO2023.ods).

    “GLDAS” stores the raw data of the NASA Global Land Data Assimilation System datasets that can be read, extracted (variable name), and processed using code “8_GLDAS_read_extract_trend” from the respective folder. One folder contains data from 1975-2022 and a second the additional January and February 2023 data.

    “built_up” contains the landcover and built-up change data from 1975 to 2022. This folder is subdivided into two subfolder which contain the raw data and the already processed data. “raw_data” contains the unprocessed datasets and “derived_data” stores the cropped built_up datasets at 5 year intervals, e.g., “Levant_built_up_1975.tif”.

    Code structure

    1_MODIS_NDVI_hdf_file_extraction.R


    This is the first code chunk that refers to the extraction of MODIS data from .hdf file format. The following packages must be installed and the raw data must be downloaded using a simple mass downloader, e.g., from google chrome. Packages: terra. Download MODIS data from after registration from: https://lpdaac.usgs.gov/products/mod13q1v061/ or https://search.earthdata.nasa.gov/search (MODIS/Terra Vegetation Indices 16-Day L3 Global 250m SIN Grid V061, last accessed, 09th of October 2023). The code reads a list of files, extracts the NDVI, and saves each file to a single .tif-file with the indication “NDVI”. Because the study area is quite large, we have to load three different (spatially) time series and merge them later. Note that the time series are temporally consistent.


    2_MERGE_MODIS_tiles.R


    In this code, we load and merge the three different stacks to produce large and consistent time series of NDVI imagery across the study area. We further use the package gtools to load the files in (1, 2, 3, 4, 5, 6, etc.). Here, we have three stacks from which we merge the first two (stack 1, stack 2) and store them. We then merge this stack with stack 3. We produce single files named NDVI_final_*consecutivenumber*.tif. Before saving the final output of single merged files, create a folder called “merged” and set the working directory to this folder, e.g., setwd("your directory_MODIS/merged").


    3_CROP_MODIS_merged_tiles.R


    Now we want to crop the derived MODIS tiles to our study area. We are using a mask, which is provided as .shp file in the repository, named "MERGED_LEVANT.shp". We load the merged .tif files and crop the stack with the vector. Saving to individual files, we name them “NDVI_merged_clip_*consecutivenumber*.tif. We now produced single cropped NDVI time series data from MODIS.
    The repository provides the already clipped and merged NDVI datasets.


    4_TREND_analysis_NDVI.R


    Now, we want to perform trend analysis from the derived data. The data we load is tricky as it contains 16-days return period across a year for the period of 22 years. Growing season sums contain MAM (March-May), JJA (June-August), and SON (September-November). December is represented as a single file, which means that the period DJF (December-February) is represented by 5 images instead of 6. For the last DJF period (December 2022), the data from January and February 2023 can be added. The code selects the respective images from the stack, depending on which period is under consideration. From these stacks, individual annually resolved growing season sums are generated and the slope is calculated. We can then extract the p-values of the trend and characterize all values with high confidence level (0.05). Using the ggplot2 package and the melt function from reshape2 package, we can create a plot of the reclassified NDVI trends together with a local smoother (LOESS) of value 0.3.
    To increase comparability and understand the amplitude of the trends, z-scores were calculated and plotted, which show the deviation of the values from the mean. This has been done for the NDVI values as well as the GLDAS climate variables as a normalization technique.


    5_BUILT_UP_change_raster.R


    Let us look at the landcover changes now. We are working with the terra package and get raster data from here: https://ghsl.jrc.ec.europa.eu/download.php?ds=bu (last accessed 03. March 2023, 100 m resolution, global coverage). Here, one can download the temporal coverage that is aimed for and reclassify it using the code after cropping to the individual study area. Here, I summed up different raster to characterize the built-up change in continuous values between 1975 and 2022.


    6_POPULATION_numbers_plot.R


    For this plot, one needs to load the .csv-file “Socio_cultural_political_development_database_FAO2023.csv” from the repository. The ggplot script provided produces the desired plot with all countries under consideration.


    7_YIELD_plot.R


    In this section, we are using the country productivity from the supplement in the repository “yield_productivity” (e.g., "Jordan_yield.csv". Each of the single country yield datasets is plotted in a ggplot and combined using the patchwork package in R.


    8_GLDAS_read_extract_trend


    The last code provides the basis for the trend analysis of the climate variables used in the paper. The raw data can be accessed https://disc.gsfc.nasa.gov/datasets?keywords=GLDAS%20Noah%20Land%20Surface%20Model%20L4%20monthly&page=1 (last accessed 9th of October 2023). The raw data comes in .nc file format and various variables can be extracted using the [“^a variable name”] command from the spatraster collection. Each time you run the code, this variable name must be adjusted to meet the requirements for the variables (see this link for abbreviations: https://disc.gsfc.nasa.gov/datasets/GLDAS_CLSM025_D_2.0/summary, last accessed 09th of October 2023; or the respective code chunk when reading a .nc file with the ncdf4 package in R) or run print(nc) from the code or use names(the spatraster collection).
    Choosing one variable, the code uses the MERGED_LEVANT.shp mask from the repository to crop and mask the data to the outline of the study area.
    From the processed data, trend analysis are conducted and z-scores were calculated following the code described above. However, annual trends require the frequency of the time series analysis to be set to value = 12. Regarding, e.g., rainfall, which is measured as annual sums and not means, the chunk r.sum=r.sum/12 has to be removed or set to r.sum=r.sum/1 to avoid calculating annual mean values (see other variables). Seasonal subset can be calculated as described in the code. Here, 3-month subsets were chosen for growing seasons, e.g. March-May (MAM), June-July (JJA), September-November (SON), and DJF (December-February, including Jan/Feb of the consecutive year).
    From the data, mean values of 48 consecutive years are calculated and trend analysis are performed as describe above. In the same way, p-values are extracted and 95 % confidence level values are marked with dots on the raster plot. This analysis can be performed with a much longer time series, other variables, ad different spatial extent across the globe due to the availability of the GLDAS variables.

  9. Countries by population 2021 (Worldometer)

    • kaggle.com
    zip
    Updated Aug 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artem Zapara (2021). Countries by population 2021 (Worldometer) [Dataset]. https://www.kaggle.com/datasets/artemzapara/countries-by-population-2021-worldometer
    Explore at:
    zip(8163 bytes)Available download formats
    Dataset updated
    Aug 16, 2021
    Authors
    Artem Zapara
    Description

    Context

    This dataset is a clean CSV file with the most recent estimates of the population of the countries according to Wolrdometer. The data is taken from the following link: https://www.worldometers.info/world-population/population-by-country/

    Content

    The data has been generated by websraping the aforementioned link on the 16th August 2021. Below is the code used to make CSV data in Python 3.8: import requests from bs4 import BeautifulSoup import pandas as pd url = "https://www.worldometers.info/world-population/population-by-country/" r = requests.get(url) soup = BeautifulSoup(r.content) countries = soup.find_all("table")[0] dataframe = pd.read_html(str(countries))[0] dataframe.to_csv("countries_by_population_2021.csv", index=False)

    Acknowledgements

    The creation of this dataset would not be possible without a team of Worldometers, a data aggregation website.

  10. c

    Flight Data: Bengaluru to Delhi (MakeMyTrip Dataset)

    • crawlfeeds.com
    json, zip
    Updated Feb 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawl Feeds (2025). Flight Data: Bengaluru to Delhi (MakeMyTrip Dataset) [Dataset]. https://crawlfeeds.com/datasets/flights-list-from-bengaluru-to-delhi
    Explore at:
    zip, jsonAvailable download formats
    Dataset updated
    Feb 10, 2025
    Dataset authored and provided by
    Crawl Feeds
    License

    https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

    Area covered
    Bengaluru, Delhi
    Description

    This dataset provides a detailed list of flights from Bengaluru to Delhi extracted from MakeMyTrip via Crawl Feeds. It includes essential flight information for the period between 8th December 2021 and 10th March 2022, making it ideal for analyzing travel trends, airline performance, and pricing patterns during this time frame.

    Key Features:

    • Flight Routes: All flights operating between Bengaluru (BLR) and Delhi (DEL).
    • Dates: Data collected between December 8, 2021, and March 10, 2022.
    • Airline Information: Details about different airlines, including flight numbers, departure and arrival times, and layovers (if any).
    • Flight Duration: Information on total flight times, including direct and connecting flights.
    • Price Data: Collected fare details for various classes (economy, business, premium).
    • Data Format: Structured dataset in CSV format for easy import into analysis tools like Excel, Python, or R.

    Applications:

    • Flight Trend Analysis: Analyze flight availability, pricing, and demand over the given time period.
    • Airline Comparison: Compare flight durations, fares, and services offered by various airlines.
    • Travel Demand Insights: Evaluate how air travel between Bengaluru and Delhi fluctuated during these months.
    • Price Forecasting: Study fare patterns to predict future pricing trends for flights between these cities.

    For a more extensive analysis of travel trends and to gain deeper insights into the travel industry, explore our Travel & Tourism Data offerings. Our comprehensive datasets can help you anticipate customer needs, optimize operations, and provide personalized experiences to stay ahead in the competitive travel market.

  11. RailEnV-PASMVS: a dataset for multi-view stereopsis training and...

    • zenodo.org
    • resodate.org
    • +2more
    bin, csv, png, txt +1
    Updated Jul 18, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    André Broekman; André Broekman; Petrus Johannes Gräbe; Petrus Johannes Gräbe (2024). RailEnV-PASMVS: a dataset for multi-view stereopsis training and reconstruction applications [Dataset]. http://doi.org/10.5281/zenodo.5233840
    Explore at:
    bin, csv, txt, zip, pngAvailable download formats
    Dataset updated
    Jul 18, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    André Broekman; André Broekman; Petrus Johannes Gräbe; Petrus Johannes Gräbe
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A Perfectly Accurate, Synthetic dataset featuring a virtual railway EnVironment for Multi-View Stereopsis (RailEnV-PASMVS) is presented, consisting of 40 scenes and 79,800 renderings together with ground truth depth maps, extrinsic and intrinsic camera parameters and binary segmentation masks of all the track components and surrounding environment. Every scene is rendered from a set of 3 cameras, each positioned relative to the track for optimal 3D reconstruction of the rail profile. The set of cameras is translated across the 100-meter length of tangent (straight) track to yield a total of 1,995 camera views. Photorealistic lighting of each of the 40 scenes is achieved with the implementation of high-definition, high dynamic range (HDR) environmental textures. Additional variation is introduced in the form of camera focal lengths, random noise for the camera location and rotation parameters and shader modifications of the rail profile. Representative track geometry data is used to generate random and unique vertical alignment data for the rail profile for every scene. This primary, synthetic dataset is augmented by a smaller image collection consisting of 320 manually annotated photographs for improved segmentation performance. The specular rail profile represents the most challenging component for MVS reconstruction algorithms, pipelines and neural network architectures, increasing the ambiguity and complexity of the data distribution. RailEnV-PASMVS represents an application specific dataset for railway engineering, against the backdrop of existing datasets available in the field of computer vision, providing the precision required for novel research applications in the field of transportation engineering.

    File descriptions

    • RailEnV-PASMVS.blend (227 Mb) - Blender file (developed using Blender version 2.8.1) used to generate the dataset. The Blender file packs only one of the HDR environmental textures to use as an example, along with all the other asset textures.
    • RailEnV-PASMVS_sample.png (28 Mb) - A visual collage of 30 scenes, illustrating the variability introduced by using different models, illumination, material properties and camera focal lengths.
    • geometry.zip (2 Mb) - Geometry CSV files used for scenes 01 to 20. The Bezier curve defines the geometry of the rail profile (10 mm intervals).
    • PhysicalDataset.7z (2.0 Gb) - A smaller, secondary dataset of 320 manually annotated photographs of railway environments; only the railway profiles are annotated.
    • 01.7z-40.7z (2.0 Gb each) - Archive of every scene (01 through 40).
    • all_list.txt, training_list.txt, validation_list.txt - Text files containing the all the scene names, together with those used for validation (validation_list.txt) and training (training_list.txt), used by MVSNet.
    • index.csv - CSV file provides a convenient reference for all the sample files, linking the corresponding file and relative data path.

    Steps to reproduce

    The open source Blender software suite (https://www.blender.org/) was used to generate the dataset, with the entire pipeline developed using the exposed Python API interface. The camera trajectory is kept fixed for all 40 scenes, except for small perturbations introduced in the form of random noise to increase the camera variation. The camera intrinsic information was initially exported as a single CSV file (scene.csv) for every scene, from which the camera information files were generated; this includes the focal length (focalLengthmm), image sensor dimensions (pixelDimensionX, pixelDimensionY), position, coordinate vector (vectC) and rotation vector (vectR). The STL model files, as provided in this data repository, were exported directly from Blender, such that the geometry/scenes can be reproduced. The data processing below is written for a Python implementation, transforming the information from Blender's coordinate system into universal rotation (R_world2cv) and translation (T_world2cv) matrices.

    import numpy as np
    from scipy.spatial.transform import Rotation as R
    
    #The intrinsic matrix K is constructed using the following formulation:
    focalLengthPixel = focalLengthmm x pixelDimensionX / sensorWidthmm
    K = [[focalLengthPixel, 0, dimX/2],
       [0, focalPixel, dimY/2],
       [0, 0, 1]]
    
    #The rotation vector as provided by Blender was first transformed to a rotation matrix:
    r = R.from_euler('xyz', vectR, degrees=True)
    matR = r.as_matrix()
    
    #Transpose the rotation matrix, to find matrix from the WORLD to BLENDER coordinate system:
    R_world2bcam = np.transpose(matR)
    
    #The matrix describing the transformation from BLENDER to CV/STANDARD coordinates is:
    R_bcam2cv = np.array([[1, 0, 0],
                   [0, -1, 0],
                   [0, 0, -1]])
    
    #Thus the representation from WORLD to CV/STANDARD coordinates is:
    R_world2cv = R_bcam2cv.dot(R_world2bcam)
    
    #The camera coordinate vector requires a similar transformation moving from BLENDER to WORLD coordinates:
    T_world2bcam = -1 * R_world2bcam.dot(vectC)
    T_world2cv = R_bcam2cv.dot(T_world2bcam)

    The resulting R_world2cv and T_world2cv matrices are written to the camera information file using exactly the same format as that of BlendedMVS developed by Dr. Yao. The original rotation and translation information can be found by following the process in reverse. Note that additional steps were required to convert from Blender's unique coordinate system to that of OpenCV; this ensures universal compatibility in the way that the camera intrinsic and extrinsic information is provided.

    Equivalent GPS information is provided (gps.csv), whereby the local coordinate frame is transformed into equivalent GPS information, centered around the Engineering 4.0 campus, University of Pretoria, South Africa. This information is embedded within the JPG files as EXIF data.

  12. m

    Dataset of Elasticities of Demand for Several Agricultural Commodities

    • data.mendeley.com
    Updated Jan 15, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jason Holderieath (2020). Dataset of Elasticities of Demand for Several Agricultural Commodities [Dataset]. http://doi.org/10.17632/x8hjx378rt.1
    Explore at:
    Dataset updated
    Jan 15, 2020
    Authors
    Jason Holderieath
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    • aidsSystemData-DIB.xlsx contains the first combination of the USDA ERS data into a single tabular format. It also contains the formulas for the calculation of derived data. • aidsSystemData3.csv is the .xlsx file converted to .csv for import into R. • aidsvdataInbrief.R contains all code used to estimate and calculate elasticities. The following packages must be installed for the script: 'tidyverse', 'lubridate', 'micEconAids', and 'broom'. • Results.xlsx contains all results. Tabs are labeled as “results_” lags between price and quantity, and “h” or “m” to indicate Hicksian or Marshallian. • all.Rdata contains all results and intermediate objects.

  13. l

    LSC (Leicester Scientific Corpus)

    • figshare.le.ac.uk
    Updated Apr 15, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neslihan Suzen (2020). LSC (Leicester Scientific Corpus) [Dataset]. http://doi.org/10.25392/leicester.data.9449639.v2
    Explore at:
    Dataset updated
    Apr 15, 2020
    Dataset provided by
    University of Leicester
    Authors
    Neslihan Suzen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Leicester
    Description

    The LSC (Leicester Scientific Corpus)

    April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk) Supervised by Prof Alexander Gorban and Dr Evgeny MirkesThe data are extracted from the Web of Science [1]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.[Version 2] A further cleaning is applied in Data Processing for LSC Abstracts in Version 1*. Details of cleaning procedure are explained in Step 6.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v1.Getting StartedThis text provides the information on the LSC (Leicester Scientific Corpus) and pre-processing steps on abstracts, and describes the structure of files to organise the corpus. This corpus is created to be used in future work on the quantification of the meaning of research texts and make it available for use in Natural Language Processing projects.LSC is a collection of abstracts of articles and proceeding papers published in 2014, and indexed by the Web of Science (WoS) database [1]. The corpus contains only documents in English. Each document in the corpus contains the following parts:1. Authors: The list of authors of the paper2. Title: The title of the paper 3. Abstract: The abstract of the paper 4. Categories: One or more category from the list of categories [2]. Full list of categories is presented in file ‘List_of _Categories.txt’. 5. Research Areas: One or more research area from the list of research areas [3]. Full list of research areas is presented in file ‘List_of_Research_Areas.txt’. 6. Total Times cited: The number of times the paper was cited by other items from all databases within Web of Science platform [4] 7. Times cited in Core Collection: The total number of times the paper was cited by other papers within the WoS Core Collection [4]The corpus was collected in July 2018 online and contains the number of citations from publication date to July 2018. We describe a document as the collection of information (about a paper) listed above. The total number of documents in LSC is 1,673,350.Data ProcessingStep 1: Downloading of the Data Online

    The dataset is collected manually by exporting documents as Tab-delimitated files online. All documents are available online.Step 2: Importing the Dataset to R

    The LSC was collected as TXT files. All documents are extracted to R.Step 3: Cleaning the Data from Documents with Empty Abstract or without CategoryAs our research is based on the analysis of abstracts and categories, all documents with empty abstracts and documents without categories are removed.Step 4: Identification and Correction of Concatenate Words in AbstractsEspecially medicine-related publications use ‘structured abstracts’. Such type of abstracts are divided into sections with distinct headings such as introduction, aim, objective, method, result, conclusion etc. Used tool for extracting abstracts leads concatenate words of section headings with the first word of the section. For instance, we observe words such as ConclusionHigher and ConclusionsRT etc. The detection and identification of such words is done by sampling of medicine-related publications with human intervention. Detected concatenate words are split into two words. For instance, the word ‘ConclusionHigher’ is split into ‘Conclusion’ and ‘Higher’.The section headings in such abstracts are listed below:

    Background Method(s) Design Theoretical Measurement(s) Location Aim(s) Methodology Process Abstract Population Approach Objective(s) Purpose(s) Subject(s) Introduction Implication(s) Patient(s) Procedure(s) Hypothesis Measure(s) Setting(s) Limitation(s) Discussion Conclusion(s) Result(s) Finding(s) Material (s) Rationale(s) Implications for health and nursing policyStep 5: Extracting (Sub-setting) the Data Based on Lengths of AbstractsAfter correction, the lengths of abstracts are calculated. ‘Length’ indicates the total number of words in the text, calculated by the same rule as for Microsoft Word ‘word count’ [5].According to APA style manual [6], an abstract should contain between 150 to 250 words. In LSC, we decided to limit length of abstracts from 30 to 500 words in order to study documents with abstracts of typical length ranges and to avoid the effect of the length to the analysis.

    Step 6: [Version 2] Cleaning Copyright Notices, Permission polices, Journal Names and Conference Names from LSC Abstracts in Version 1Publications can include a footer of copyright notice, permission policy, journal name, licence, author’s right or conference name below the text of abstract by conferences and journals. Used tool for extracting and processing abstracts in WoS database leads to attached such footers to the text. For example, our casual observation yields that copyright notices such as ‘Published by Elsevier ltd.’ is placed in many texts. To avoid abnormal appearances of words in further analysis of words such as bias in frequency calculation, we performed a cleaning procedure on such sentences and phrases in abstracts of LSC version 1. We removed copyright notices, names of conferences, names of journals, authors’ rights, licenses and permission policies identified by sampling of abstracts.Step 7: [Version 2] Re-extracting (Sub-setting) the Data Based on Lengths of AbstractsThe cleaning procedure described in previous step leaded to some abstracts having less than our minimum length criteria (30 words). 474 texts were removed.Step 8: Saving the Dataset into CSV FormatDocuments are saved into 34 CSV files. In CSV files, the information is organised with one record on each line and parts of abstract, title, list of authors, list of categories, list of research areas, and times cited is recorded in fields.To access the LSC for research purposes, please email to ns433@le.ac.uk.References[1]Web of Science. (15 July). Available: https://apps.webofknowledge.com/ [2]WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [3]Research Areas in WoS. Available: https://images.webofknowledge.com/images/help/WOS/hp_research_areas_easca.html [4]Times Cited in WoS Core Collection. (15 July). Available: https://support.clarivate.com/ScientificandAcademicResearch/s/article/Web-of-Science-Times-Cited-accessibility-and-variation?language=en_US [5]Word Count. Available: https://support.office.com/en-us/article/show-word-count-3c9e6a11-a04d-43b4-977c-563a0e0d5da3 [6]A. P. Association, Publication manual. American Psychological Association Washington, DC, 1983.

  14. F

    Data from: Solar self-sufficient households as a driving factor for...

    • data.uni-hannover.de
    .zip, r, rdata +2
    Updated Dec 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Institut für Kartographie und Geoinformatik (2024). Solar self-sufficient households as a driving factor for sustainability transformation [Dataset]. https://data.uni-hannover.de/dataset/solar-self-sufficient-households-as-a-driving-factor-for-sustainability-transformation
    Explore at:
    r, text/x-sh, rdata, txt, .zipAvailable download formats
    Dataset updated
    Dec 12, 2024
    Dataset authored and provided by
    Institut für Kartographie und Geoinformatik
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    To get the consumption model from Section 3.1, one needs load execute the file consumption_data.R. Load the data for the 3 Phases ./data/CONSUMPTION/PL1.csv, PL2.csv, PL3.csv, transform the data and build the model (starting line 225). The final consumption data can be found in one file for each year in ./data/CONSUMPTION/MEGA_CONS_list.Rdata

    To get the results for the optimization problem, one needs to execute the file analyze_data.R. It provides the functions to compare production and consumption data, and to optimize for the different values (PV, MBC,).

    To reproduce the figures one needs to execute the file visualize_results.R. It provides the functions to reproduce the figures.

    To calculate the solar radiation that is needed in the Section Production Data, follow file calculate_total_radiation.R.

    To reproduce the radiation data from from ERA5, that can be found in data.zip, do the following steps: 1. ERA5 - download the reanalysis datasets as GRIB file. For FDIR select "Total sky direct solar radiation at surface", for GHI select "Surface solar radiation downwards", and for ALBEDO select "Forecast albedo". 2. convert GRIB to csv with the file era5toGRID.sh 3. convert the csv file to the data that is used in this paper with the file convert_year_to_grid.R

  15. d

    R-LOADEST files to produce results in the Heart River Basin, North Dakota,...

    • catalog.data.gov
    • data.usgs.gov
    Updated Nov 20, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). R-LOADEST files to produce results in the Heart River Basin, North Dakota, 1970-2020 [Dataset]. https://catalog.data.gov/dataset/r-loadest-files-to-produce-results-in-the-heart-river-basin-north-dakota-1970-2020
    Explore at:
    Dataset updated
    Nov 20, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    North Dakota, Heart River
    Description

    This child page contains a zipped folder which contains all of the items necessary to run load estimation using R-LOADEST to produce results that are published in U.S. Geological Survey Investigations Report 2021-XXXX [Tatge, W.S., Nustad, R.A., and Galloway, J.M., 2021, Evaluation of Salinity and Nutrient Conditions in the Heart River Basin, North Dakota, 1970-2020: U.S. Geological Survey Scientific Investigations Report 2021-XXXX, XX p]. The folder contains an allsiteinfo.table.csv file, a "datain" folder, and a "scripts" folder. The allsiteinfo.table.csv file can be used to cross reference the sites with the main report (Tatge and others, 2021). The "datain" folder contains all the input data necessary to reproduce the load estimation results. The naming convention in the "datain" folder is site_MI_rloadest or site_NUT_rloadest for either the major ion loads or the nutrient loads. The .Rdata files are used in the scripts to run the estimations and the .csv files can be used to look at the data. The "scripts" folder contains the written R scripts to produce the results of the load estimation from the main report. R-LOADEST is a software package for analyzing loads in streams and an accompanying report (Runkel and others, 2004) serves as the formal documentation for R-LOADEST. The package is a collection of functions written in R (R Development Core Team, 2019), an open source language and a general environment for statistical computing and graphics. The following system requirements are necessary for producing results: Windows 10 operating system R (version 3.4 or later; 64-bit recommended) RStudio (version 1.1.456 or later) R-LOADEST program (available at https://github.com/USGS-R/rloadest). Runkel, R.L., Crawford, C.G., and Cohn, T.A., 2004, Load Estimator (LOADEST): A FORTRAN Program for Estimating Constituent Loads in Streams and Rivers: U.S. Geological Survey Techniques and Methods Book 4, Chapter A5, 69 p., [Also available at https://pubs.usgs.gov/tm/2005/tm4A5/pdf/508final.pdf.] R Development Core Team, 2019, R—A language and environment for statistical computing: Vienna, Austria, R Foundation for Statistical Computing, accessed December 7, 2020, at https://www.r-project.org.

  16. n

    Data from: Generalizable EHR-R-REDCap pipeline for a national...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Jan 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller (2022). Generalizable EHR-R-REDCap pipeline for a national multi-institutional rare tumor patient registry [Dataset]. http://doi.org/10.5061/dryad.rjdfn2zcm
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 9, 2022
    Dataset provided by
    Harvard Medical School
    Massachusetts General Hospital
    Authors
    Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.

    Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.

    Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.

    Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR data is successfully transformed, and bulk-loaded/imported into a REDCap-based national registry to execute real-world data analysis and interoperability.

    Methods eLAB Development and Source Code (R statistical software):

    eLAB is written in R (version 4.0.3), and utilizes the following packages for processing: DescTools, REDCapR, reshape2, splitstackshape, readxl, survival, survminer, and tidyverse. Source code for eLAB can be downloaded directly (https://github.com/TheMillerLab/eLAB).

    eLAB reformats EHR data abstracted for an identified population of patients (e.g. medical record numbers (MRN)/name list) under an Institutional Review Board (IRB)-approved protocol. The MCCPR does not host MRNs/names and eLAB converts these to MCCPR assigned record identification numbers (record_id) before import for de-identification.

    Functions were written to remap EHR bulk lab data pulls/queries from several sources including Clarity/Crystal reports or institutional EDW including Research Patient Data Registry (RPDR) at MGB. The input, a csv/delimited file of labs for user-defined patients, may vary. Thus, users may need to adapt the initial data wrangling script based on the data input format. However, the downstream transformation, code-lab lookup tables, outcomes analysis, and LOINC remapping are standard for use with the provided REDCap Data Dictionary, DataDictionary_eLAB.csv. The available R-markdown ((https://github.com/TheMillerLab/eLAB) provides suggestions and instructions on where or when upfront script modifications may be necessary to accommodate input variability.

    The eLAB pipeline takes several inputs. For example, the input for use with the ‘ehr_format(dt)’ single-line command is non-tabular data assigned as R object ‘dt’ with 4 columns: 1) Patient Name (MRN), 2) Collection Date, 3) Collection Time, and 4) Lab Results wherein several lab panels are in one data frame cell. A mock dataset in this ‘untidy-format’ is provided for demonstration purposes (https://github.com/TheMillerLab/eLAB).

    Bulk lab data pulls often result in subtypes of the same lab. For example, potassium labs are reported as “Potassium,” “Potassium-External,” “Potassium(POC),” “Potassium,whole-bld,” “Potassium-Level-External,” “Potassium,venous,” and “Potassium-whole-bld/plasma.” eLAB utilizes a key-value lookup table with ~300 lab subtypes for remapping labs to the Data Dictionary (DD) code. eLAB reformats/accepts only those lab units pre-defined by the registry DD. The lab lookup table is provided for direct use or may be re-configured/updated to meet end-user specifications. eLAB is designed to remap, transform, and filter/adjust value units of semi-structured/structured bulk laboratory values data pulls from the EHR to align with the pre-defined code of the DD.

    Data Dictionary (DD)

    EHR clinical laboratory data is captured in REDCap using the ‘Labs’ repeating instrument (Supplemental Figures 1-2). The DD is provided for use by researchers at REDCap-participating institutions and is optimized to accommodate the same lab-type captured more than once on the same day for the same patient. The instrument captures 35 clinical lab types. The DD serves several major purposes in the eLAB pipeline. First, it defines every lab type of interest and associated lab unit of interest with a set field/variable name. It also restricts/defines the type of data allowed for entry for each data field, such as a string or numerics. The DD is uploaded into REDCap by every participating site/collaborator and ensures each site collects and codes the data the same way. Automation pipelines, such as eLAB, are designed to remap/clean and reformat data/units utilizing key-value look-up tables that filter and select only the labs/units of interest. eLAB ensures the data pulled from the EHR contains the correct unit and format pre-configured by the DD. The use of the same DD at every participating site ensures that the data field code, format, and relationships in the database are uniform across each site to allow for the simple aggregation of the multi-site data. For example, since every site in the MCCPR uses the same DD, aggregation is efficient and different site csv files are simply combined.

    Study Cohort

    This study was approved by the MGB IRB. Search of the EHR was performed to identify patients diagnosed with MCC between 1975-2021 (N=1,109) for inclusion in the MCCPR. Subjects diagnosed with primary cutaneous MCC between 2016-2019 (N= 176) were included in the test cohort for exploratory studies of lab result associations with overall survival (OS) using eLAB.

    Statistical Analysis

    OS is defined as the time from date of MCC diagnosis to date of death. Data was censored at the date of the last follow-up visit if no death event occurred. Univariable Cox proportional hazard modeling was performed among all lab predictors. Due to the hypothesis-generating nature of the work, p-values were exploratory and Bonferroni corrections were not applied.

  17. Digitisation of Weather Records of Seungjeongwon Ilgi: A Historical Weather...

    • zenodo.org
    bin, csv, json, txt
    Updated Sep 27, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zeyu Lyu; Zeyu Lyu; Kohei Ichikawa; Kohei Ichikawa; Yongchao Cheng; Yongchao Cheng; Hisashi Hayakawa; Hisashi Hayakawa; Yukiko Kawamoto; Yukiko Kawamoto (2023). Digitisation of Weather Records of Seungjeongwon Ilgi: A Historical Weather Dynamics Dataset of the Korean Peninsula (1623-1910) [Dataset]. http://doi.org/10.5281/zenodo.7453644
    Explore at:
    csv, json, bin, txtAvailable download formats
    Dataset updated
    Sep 27, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Zeyu Lyu; Zeyu Lyu; Kohei Ichikawa; Kohei Ichikawa; Yongchao Cheng; Yongchao Cheng; Hisashi Hayakawa; Hisashi Hayakawa; Yukiko Kawamoto; Yukiko Kawamoto
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Korea
    Description

    Introduction

    This study has exploited the daily weather records of Seungjeongwon Ilgi from the NIKH database. Seungjeongwon Ilgi (http://sjw.history.go.kr/main.do) is a daily record of the Seungjeongwon, the Royal Secretariat of the Joseon Dynasty of Korea. These diaries span from 1623 to 1910 and generally involve daily weather records in the entry header. Their observational site would be located in Seoul (N37°35′, E126°59′). We have exploited the weather records from the NIKH database and classified the daily weather using text mining method. We have also converted the report dates from the traditional lunisolar calendar to the Gregorian calendar, to better contextualise our data into the contemporary daily measurements.

    Data

    We provide different formats (csv, xlsx, json) to facilitate the usage of data. The main contents of data are listed as below.

    • ID: The unique identifier of a specific record in the metadata, which can also serve as the identifier to merge with external data in the NIKH digital database.
    • Traditional calendar: The original lunar dates in the NIKH digital database, which are listed in data format "YYYY-MM-DD". More specifically, "L0" implies the leap year and "L1" implies the common year.
    • Leap: The identifier of a leap year.
    • Gregorian calendar: The Gregorian calendar date that converted by the traditional calendar date.
    • Weather Text: The text that describe the weather conditions. Specifically, multiple weather descriptions of the same day have been put together.
    • Flag: The computed value that indicates different combinations of weather conditions.
    • Volume: The volume of text in the original record.
    • Herbal Volume: The volume of text in the herbal record.
    • Sunny: A dummy variable that represents whether the weather description contains the expression of sunny.
    • Cloudy: A dummy variable that represents whether the weather description contains the expression of cloudy.
    • Rainy: A dummy variable that represents whether the weather description contains the expression of rainy.
    • Snow: A dummy variable that represents whether the weather description contains the expression of snow.
    • Wind: A dummy variable that represents whether the weather description contains the expression of wind.

    Import Data

    # Python
    # CSV file
    import pandas as pd
    data=pd.read_csv('~/SJWilgi_Seoul_Weather_YR1623_1910.csv',encoding="utf-8") 
    # JSON file
    data=pd.read_json('~/SJWilgi_Seoul_Weather_YR1623_1910.json',encoding="utf-8")
    # Excel file
    data=pd.read_excel('~/SJWilgi_Seoul_Weather_YR1623_1910.xlsx') # Excel file
    # R
    # CSV file
    library(readr)
    data<- read_csv("~/SJWilgi_Seoul_Weather_YR1623_1910.csv")
    # Excel file
    library(readxl)
    data <- read_excel("~/SJWilgi_Seoul_Weather_YR1623_1910.xlsx")

  18. f

    Dataset for paper "Anticipating causes and consequences" currently under...

    • figshare.com
    • sussex.figshare.com
    txt
    Updated Aug 11, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alan Garnham (2020). Dataset for paper "Anticipating causes and consequences" currently under review - Experiment 2 Early Data with Differences [Dataset]. http://doi.org/10.25377/sussex.11637702.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Aug 11, 2020
    Dataset provided by
    University of Sussex
    Authors
    Alan Garnham
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    csv file for import into R as data framefor the analysis:Dependent Variable - dplooks - difference in proportion of looks to NP1 and NP2 pictures (averaged across 200-1100ms from the beginning of the conjunction, "because" or "and so")Independent VariablesVbias - Causal bias of the verb (note that consequentiality bias is to the other NP)Conj - "because" or "and so"Sources of random effectsPart - participantsItem - itemsDV is numerical, other variables are factorsThe file includes other variables

  19. u

    Data from: Time Series from Smart Meters

    • observatorio-cientifico.ua.es
    • data.niaid.nih.gov
    Updated 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cruz Enrique Borges Hernandez; Carlos Quesada-Granja; Cruz Enrique Borges Hernandez; Carlos Quesada-Granja (2021). Time Series from Smart Meters [Dataset]. https://observatorio-cientifico.ua.es/documentos/668fc48cb9e7c03b01be0b7e
    Explore at:
    Dataset updated
    2021
    Authors
    Cruz Enrique Borges Hernandez; Carlos Quesada-Granja; Cruz Enrique Borges Hernandez; Carlos Quesada-Granja
    Description

    This file contains technical problems that make it insuitable for public use. Please use this dataset instead.

    Name: Time Series from Smart Meters

    Summary: The dataset contains: (1) raw and cleaned time series of smart meters from Spanish electric cooperatives, and (2) feature values and metadata extracted from residential load profiles, both from Spanish electric cooperatives and publicly available datasets.

    License: CC BY-NC-SA

    Acknowledge: These data have been collected in the framework of the WHY project. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 891943.

    Disclaimer: The sole responsibility for the content of this publication lies with the authors. It does not necessarily reflect the opinion of the Executive Agency for Small and Medium-sized Enterprises (EASME) or the European Commission (EC). EASME or the EC are not responsible for any use that may be made of the information contained therein.

    Collection Date: The time series on Spanish electric cooperatives were collected between 2014 and 2021. Data on feature values and metadata were extracted between 2020 and 2021.

    Publication Date:

    DOI: 10.5281/zenodo.4455198

    Other repositories:

    Author: University of Deusto

    Objective of collection: This data is collected originally for invoice of the electrical consumption. In this project will be used to segment the households.

    Description: The dataset contains a CSV file with one entry for both each load profile of the Spanish electric cooperatives and publicly available load profiles. The fields that can be found for each entry are (1) metadata such as the originating dataset to which they belong, the start and end dates, number of days that have been imputed and/or extended, provenance (country, administrative division, municipality, zip code, origin identifier (households, businesses, industry, etc. ), classification of socioeconomic conditions, and tariffs; and (2) extracted features, grouped by types such as statistical moments, quantiles, lag d-day autocorrelations, seasonal aggregates, peak and off-peak periods, load factors, energy consumed, features obtained using the R package "tsfeatures", and Catch-22 features. In addition, the dataset also contains a rData file for each entry of the Spanish electric cooperatives. These files contain the original time series (timestamped values), a field indicating whether the signal has been imputed and/or extended and another field indicating the date of the extension.

    5 star: ⭐⭐⭐

    Preprocessing steps: anonymization, data fusion, imputation of gaps, extension of time series shorter than 800 days, computation of features.

    Reuse: NA

    Update policy: The data will be updated throughout 2021.

    Ethics and legal aspects: Spanish electric cooperative data contains the CUPS (Meter Point Administration Number), which is personal data. A pre-processing step has been carried out to substitute the CUPS by a MD5 hash.

    Technical aspects: Decompressed data is quite large (big data).

    Other:

  20. Example code list definition in csv format.

    • plos.figshare.com
    xls
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David A. Springate; Rosa Parisi; Ivan Olier; David Reeves; Evangelos Kontopantelis (2023). Example code list definition in csv format. [Dataset]. http://doi.org/10.1371/journal.pone.0171784.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    David A. Springate; Rosa Parisi; Ivan Olier; David Reeves; Evangelos Kontopantelis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Example code list definition in csv format.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb (2024). Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem [Dataset]. http://doi.org/10.5281/zenodo.1419788
Organization logo

Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem

Related Article
Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
bin, application/gzip, zip, text/x-pythonAvailable download formats
Dataset updated
Aug 2, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb
License

https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html

Description
Replication pack, FSE2018 submission #164:
------------------------------------------
**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: 
A Case Study of the PyPI Ecosystem

**Note:** link to data artifacts is already included in the paper. 
Link to the code will be included in the Camera Ready version as well.


Content description
===================

- **ghd-0.1.0.zip** - the code archive. This code produces the dataset files 
 described below
- **settings.py** - settings template for the code archive.
- **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset.
 This dataset only includes stats aggregated by the ecosystem (PyPI)
- **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level
 statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages
 themselves, which take around 2TB.
- **build_model.r, helpers.r** - R files to process the survival data 
  (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, 
  `common.cache/survival_data.pypi_2008_2017-12_6.csv` in 
  **dataset_full_Jan_2018.tgz**)
- **Interview protocol.pdf** - approximate protocol used for semistructured interviews.
- LICENSE - text of GPL v3, under which this dataset is published
- INSTALL.md - replication guide (~2 pages)
Replication guide
=================

Step 0 - prerequisites
----------------------

- Unix-compatible OS (Linux or OS X)
- Python interpreter (2.7 was used; Python 3 compatibility is highly likely)
- R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible)

Depending on detalization level (see Step 2 for more details):
- up to 2Tb of disk space (see Step 2 detalization levels)
- at least 16Gb of RAM (64 preferable)
- few hours to few month of processing time

Step 1 - software
----------------

- unpack **ghd-0.1.0.zip**, or clone from gitlab:

   git clone https://gitlab.com/user2589/ghd.git
   git checkout 0.1.0
 
 `cd` into the extracted folder. 
 All commands below assume it as a current directory.
  
- copy `settings.py` into the extracted folder. Edit the file:
  * set `DATASET_PATH` to some newly created folder path
  * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` 
- install docker. For Ubuntu Linux, the command is 
  `sudo apt-get install docker-compose`
- install libarchive and headers: `sudo apt-get install libarchive-dev`
- (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools`
 Without this dependency, you might get an error on the next step, 
 but it's safe to ignore.
- install Python libraries: `pip install --user -r requirements.txt` . 
- disable all APIs except GitHub (Bitbucket and Gitlab support were
 not yet implemented when this study was in progress): edit
 `scraper/init.py`, comment out everything except GitHub support
 in `PROVIDERS`.

Step 2 - obtaining the dataset
-----------------------------

The ultimate goal of this step is to get output of the Python function 
`common.utils.survival_data()` and save it into a CSV file:

  # copy and paste into a Python console
  from common import utils
  survival_data = utils.survival_data('pypi', '2008', smoothing=6)
  survival_data.to_csv('survival_data.csv')

Since full replication will take several months, here are some ways to speedup
the process:

####Option 2.a, difficulty level: easiest

Just use the precomputed data. Step 1 is not necessary under this scenario.

- extract **dataset_minimal_Jan_2018.zip**
- get `survival_data.csv`, go to the next step

####Option 2.b, difficulty level: easy

Use precomputed longitudinal feature values to build the final table.
The whole process will take 15..30 minutes.

- create a folder `
Search
Clear search
Close search
Google apps
Main menu