7 datasets found
  1. z

    The morphologically glossed Rigveda - The Zurich annotation corpus revised...

    • zenodo.org
    bin
    Updated May 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antje Casaretto; Pascal Coenen; Pascal Coenen; Anna Fischer; Anna Fischer; Jakob Halfmann; Jakob Halfmann; Natalie Korobzow; Natalie Korobzow; Daniel Kölligan; Daniel Kölligan; Uta Reinöhl; Uta Reinöhl; Antje Casaretto (2025). The morphologically glossed Rigveda - The Zurich annotation corpus revised and extended. Hosted by VedaWeb - Online Research Platform for Old Indic Texts. [Dataset]. http://doi.org/10.5281/zenodo.15489124
    Explore at:
    binAvailable download formats
    Dataset updated
    May 22, 2025
    Dataset provided by
    Zenodo
    Authors
    Antje Casaretto; Pascal Coenen; Pascal Coenen; Anna Fischer; Anna Fischer; Jakob Halfmann; Jakob Halfmann; Natalie Korobzow; Natalie Korobzow; Daniel Kölligan; Daniel Kölligan; Uta Reinöhl; Uta Reinöhl; Antje Casaretto
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Zürich
    Description

    This file contains morphological and lexicographic annotations for the Rigveda. It was created in the DFG-funded research project Vedaweb and used as source data for the linguistic research platform vedaweb.uni-koeln.de.

    Prof. Dr. Paul Widmer and Dr. Salvatore Scarlata from the "Institut für Vergleichende Sprachwissenschaft" (Universität Zürich) provided the VedaWeb project a Filemaker file that was later transformed in Cologne into an Excel file. This data contained a version of the Rigveda by Prof. Dr. A. Lubotsky ("Indo-European Linguistics", Leiden University) that had been morphosytactically annotated over the course of more than 10 years at the University of Zurich. It also contained for each token, if available, a reference to an entry in Grassmann's dictionary for the Rigveda.

    Modifications made by Jakob Halfmann and Natalie Korobzow to the data in 2020:

    Disambiguation of the relevant categories, if unspecified in Zurich data, according to the Grassmann dictionary (updates from 6th edition partially included up to page 274):

    • case, gender and number for nouns, pronouns (columns G–I)
    • number, person, mood, tense and voice for verbs (columns I–M) up to line 109216
    • case, gender, number, tense and voice for participles (columns G–I, L–M) up to line 109216
    • absolutives are marked as Abs. in columns N and V
    • Inconsistencies between the original file from Zurich and the Grassmann dictionary as well as internal inconsistencies in Grassmann are noted in column AE, whenever they were noticed.
    • Zurich data was overwritten by conflicting Grassmann data in columns G–M but retained elsewhere.
    • Verb classes according to Whitney (1885) and Jamison (1983) for class 10 in column Y, differences in root spelling between Whitney and Grassmann are noted in column Z. All potential verb classes provided by Whitney are given for every occurrence of the root.
    • Local particles and verbal forms containing them are marked as LP in column AF.
    • Comparatives and superlatives are marked as such in column X and desideratives as Des. in column Y.

    Modifications made by Anna Fischer (data transformation, technical realisation) to the data:

    New structure of data table for linguistic annotations with new column titles:

    • A - "VERS_NR": renamed column (from "belege::stelleMMSSSRR")
    • B - "PADA_NR": renamed column (from "belege::pada")
    • C - "PADA_TEXT_LUBOTSKY": renamed column (from "belege::lubotskypada")
    • D - "TOKEN_NR_VERS": renamed column (from "belege::wortnummer rc")
    • E - "TOKEN_NR_PADA": renamed column (from "belege::wortnummer pada")
    • F - "FORM": renamed column (from "belege::form")
    • G - "KASUS": renamed column (from "belege::kasus")
    • H - "GENUS": renamed column (from "belege::genus")
    • I - "NUMERUS": renamed column (from "belege::numerus")
    • J - "PERSON": renamed column (from "belege::person")
    • K - "TEMPUS": moved and renamed column (from L "belege::tempus")
    • L - "PRAESENSKLASSE": created new column for present stem class for each form
    • M - "LEMMA_PRAESENSKLASSEN": created column for present stem classes of respective lemma: Moved and renamed column (from Y "formen::zusätzliche merkmale verb"), moved values "Abs." and "Inf." to column P "INFINIT", moved values "Prek." and "si-Ipv." to column N "MOOD", moved value "Des." to column Q "ABGELEITETE_KONJUGATION": moved value "se-Form" to column W "WEITERE_WERTE"
    • N - "MODUS": moved and renamed column (from K "belege::modus")
    • O - "DIATHESE": moved and renamed column (from M "belege::diathese")
    • P - "INFINIT": created new column for infinite forms "Abs.", "Inf.", "Ptz.", "ta-Ptz.", "na-Ptz."
    • Q - "ABGELEITETE_KONJUGATION": created new column for secondary conjugation "Des.", "Int.", "Kaus."
    • R - "GRADUS": created new column for degree: "Comp.", "Sup."
    • S - "LOKALPARTIKEL": moved and renamed column (from AF "LP")
    • T - "LEMMA_ZÜRICH": moved and renamed column (from AA "lemmata klassisch::lemma")
    • U - "LEMMA_ZÜRICH_LEMMATYP": moved and renamed column (from AB "lemmata klassisch::lemmatyp")
    • V - "LEMMA_ZÜRICH_BEDEUTUNG": moved and renamed column (from AC "lemmata klassisch::bedeutung")
    • W - "WEITERE_WERTE": created new column for all miscellaneous values: e.g. "Hyperchar.", "n-haltig", "se-Form"
    • X - "KOMMENTAR": created new column merging former columns Z "formen::HELPformbestimmung", AD "lemmata klassisch::HELPbedeutung" and AE "anmerkungen abweichungen"

    Columns that were removed due to redundant information:

    • "formen::zusätzliche merkmale nomen": values "superlative" And "comparative" were renamed "sup." and "comp." and moved to new column R "GRADUS", all other values were moved to new column for miscellaneous W "WEITERE_WERTE"
    • "belege::belegbestimmung summe simpel": values "Ptz.", "ta-Ptz." and "na-Ptz." were moved to new new column P "INFINIT"
    • "belege::kasus bestof"
    • "belege::genus bestof"
    • "belege::numerus bestof"
    • "belege::person bestof"
    • "belege::modus bestof"
    • "belege::tempus bestof"
    • "belege::diathese bestof"
    • "belege::belegbestimmung bestof summe sophistiziert"

    Revisions and additions made by Antje Casaretto to the data in 2023:

    • F-T: - revision and correction (wherever necessary) of all annotations (books 1-7)
    • G,H,I - disambiguation of case forms, reg. pronouns and nominal forms, if unspecified in Zurich data (books 1-7)
    • L - disambiguation of present stem classes (book 7 and book 1 up to line 21050 vers 01.125.01)
    • M - disambiguation of denominal verbs from primary verbs of the 10th class (books 1-10)
    • N - disambiguation of precative and optative forms wherever possible (books 1-7)
    • Q - new annotations for "Int." (intensives) and "Kaus." (causatives) (books 1-7)

    Revisions and additions made by Antje Casaretto to the data in 2024 with support in data modeling and automation by Anna Fischer:

    • F-T: revision and correction (wherever necessary) of all annotations (books 8-10)
    • G,H,I: disambiguation of case and gender forms in nominal and pronominal forms, if unspecified in Zurich data (books 8-10)
    • L, M: disambiguation of present stem classes (books 1-10)
    • N: disambiguation of precative and optative forms wherever possible (books 8-10)
    • P: new annotations for "Gdv." (gerundives)
    • Q: new annotations for "Den." (denominatives) (books 1-10) and further annotations of “Kaus.” (causatives) and “Int.” (intensives) (books 8-10)
    • T: revision of lemmatization (books 1-10)
    • V: update of meanings according to revised lemmatization; minimal revision
    • W: revised annotation of ending -se (“se-Form”) (books 1-10); no systematic revision
    • X: no systematic revision
    • A-U: general revision of formal inconsistencies and typing errors (book 1-10)

    Revisions made by Natalie Korobzov and Pascal Coenen to the data in 2024 with computational support by Anna Fischer:

    • Y - "LEMMA_GRASSMANN_ID": new column for references to Grassmann dictionary (books 1-10) and revision of Grassmann references
  2. m

    R codes and dataset for Visualisation of Diachronic Constructional Change...

    • bridges.monash.edu
    • researchdata.edu.au
    zip
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gede Primahadi Wijaya Rajeg (2023). R codes and dataset for Visualisation of Diachronic Constructional Change using Motion Chart [Dataset]. http://doi.org/10.26180/5c844c7a81768
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Monash University
    Authors
    Gede Primahadi Wijaya Rajeg
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    PublicationPrimahadi Wijaya R., Gede. 2014. Visualisation of diachronic constructional change using Motion Chart. In Zane Goebel, J. Herudjati Purwoko, Suharno, M. Suryadi & Yusuf Al Aried (eds.). Proceedings: International Seminar on Language Maintenance and Shift IV (LAMAS IV), 267-270. Semarang: Universitas Diponegoro. doi: https://doi.org/10.4225/03/58f5c23dd8387Description of R codes and data files in the repositoryThis repository is imported from its GitHub repo. Versioning of this figshare repository is associated with the GitHub repo's Release. So, check the Releases page for updates (the next version is to include the unified version of the codes in the first release with the tidyverse).The raw input data consists of two files (i.e. will_INF.txt and go_INF.txt). They represent the co-occurrence frequency of top-200 infinitival collocates for will and be going to respectively across the twenty decades of Corpus of Historical American English (from the 1810s to the 2000s).These two input files are used in the R code file 1-script-create-input-data-raw.r. The codes preprocess and combine the two files into a long format data frame consisting of the following columns: (i) decade, (ii) coll (for "collocate"), (iii) BE going to (for frequency of the collocates with be going to) and (iv) will (for frequency of the collocates with will); it is available in the input_data_raw.txt. Then, the script 2-script-create-motion-chart-input-data.R processes the input_data_raw.txt for normalising the co-occurrence frequency of the collocates per million words (the COHA size and normalising base frequency are available in coha_size.txt). The output from the second script is input_data_futurate.txt.Next, input_data_futurate.txt contains the relevant input data for generating (i) the static motion chart as an image plot in the publication (using the script 3-script-create-motion-chart-plot.R), and (ii) the dynamic motion chart (using the script 4-script-motion-chart-dynamic.R).The repository adopts the project-oriented workflow in RStudio; double-click on the Future Constructions.Rproj file to open an RStudio session whose working directory is associated with the contents of this repository.

  3. f

    Petre_Slide_CategoricalScatterplotFigShare.pptx

    • figshare.com
    pptx
    Updated Sep 19, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benj Petre; Aurore Coince; Sophien Kamoun (2016). Petre_Slide_CategoricalScatterplotFigShare.pptx [Dataset]. http://doi.org/10.6084/m9.figshare.3840102.v1
    Explore at:
    pptxAvailable download formats
    Dataset updated
    Sep 19, 2016
    Dataset provided by
    figshare
    Authors
    Benj Petre; Aurore Coince; Sophien Kamoun
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Categorical scatterplots with R for biologists: a step-by-step guide

    Benjamin Petre1, Aurore Coince2, Sophien Kamoun1

    1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK

    Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.

    Protocol

    • Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.

    • Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.

    • Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.

    Notes

    • Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.

    • Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.

    7 Display the graph in a separate window. Dot colors indicate

    replicates

    graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()

    References

    Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.

    Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035

    Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128

    https://cran.r-project.org/

    http://ggplot2.org/

  4. Uniform Crime Reporting (UCR) Program Data: Arrests by Age, Sex, and Race,...

    • search.datacite.org
    • openicpsr.org
    Updated 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jacob Kaplan (2018). Uniform Crime Reporting (UCR) Program Data: Arrests by Age, Sex, and Race, 1980-2016 [Dataset]. http://doi.org/10.3886/e102263v5-10021
    Explore at:
    Dataset updated
    2018
    Dataset provided by
    Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
    DataCitehttps://www.datacite.org/
    Authors
    Jacob Kaplan
    Description

    Version 5 release notes:
    Removes support for SPSS and Excel data.Changes the crimes that are stored in each file. There are more files now with fewer crimes per file. The files and their included crimes have been updated below.
    Adds in agencies that report 0 months of the year.Adds a column that indicates the number of months reported. This is generated summing up the number of unique months an agency reports data for. Note that this indicates the number of months an agency reported arrests for ANY crime. They may not necessarily report every crime every month. Agencies that did not report a crime with have a value of NA for every arrest column for that crime.Removes data on runaways.
    Version 4 release notes:
    Changes column names from "poss_coke" and "sale_coke" to "poss_heroin_coke" and "sale_heroin_coke" to clearly indicate that these column includes the sale of heroin as well as similar opiates such as morphine, codeine, and opium. Also changes column names for the narcotic columns to indicate that they are only for synthetic narcotics.
    Version 3 release notes:
    Add data for 2016.Order rows by year (descending) and ORI.Version 2 release notes:
    Fix bug where Philadelphia Police Department had incorrect FIPS county code.
    The Arrests by Age, Sex, and Race data is an FBI data set that is part of the annual Uniform Crime Reporting (UCR) Program data. This data contains highly granular data on the number of people arrested for a variety of crimes (see below for a full list of included crimes). The data sets here combine data from the years 1980-2015 into a single file. These files are quite large and may take some time to load.
    All the data was downloaded from NACJD as ASCII+SPSS Setup files and read into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. For the R code used to clean this data, see here. https://github.com/jacobkap/crime_data. If you have any questions, comments, or suggestions please contact me at jkkaplan6@gmail.com.

    I did not make any changes to the data other than the following. When an arrest column has a value of "None/not reported", I change that value to zero. This makes the (possible incorrect) assumption that these values represent zero crimes reported. The original data does not have a value when the agency reports zero arrests other than "None/not reported." In other words, this data does not differentiate between real zeros and missing values. Some agencies also incorrectly report the following numbers of arrests which I change to NA: 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 99999, 99998.

    To reduce file size and make the data more manageable, all of the data is aggregated yearly. All of the data is in agency-year units such that every row indicates an agency in a given year. Columns are crime-arrest category units. For example, If you choose the data set that includes murder, you would have rows for each agency-year and columns with the number of people arrests for murder. The ASR data breaks down arrests by age and gender (e.g. Male aged 15, Male aged 18). They also provide the number of adults or juveniles arrested by race. Because most agencies and years do not report the arrestee's ethnicity (Hispanic or not Hispanic) or juvenile outcomes (e.g. referred to adult court, referred to welfare agency), I do not include these columns.

    To make it easier to merge with other data, I merged this data with the Law Enforcement Agency Identifiers Crosswalk (LEAIC) data. The data from the LEAIC add FIPS (state, county, and place) and agency type/subtype. Please note that some of the FIPS codes have leading zeros and if you open it in Excel it will automatically delete those leading zeros.

    I created 9 arrest categories myself. The categories are:
    Total Male JuvenileTotal Female JuvenileTotal Male AdultTotal Female AdultTotal MaleTotal FemaleTotal JuvenileTotal AdultTotal ArrestsAll of these categories are based on the sums of the sex-age categories (e.g. Male under 10, Female aged 22) rather than using the provided age-race categories (e.g. adult Black, juvenile Asian). As not all agencies report the race data, my method is more accurate. These categories also make up the data in the "simple" version of the data. The "simple" file only includes the above 9 columns as the arrest data (all other columns in the data are just agency identifier columns). Because this "simple" data set need fewer columns, I include all offenses.

    As the arrest data is very granular, and each category of arrest is its own column, there are dozens of columns per crime. To keep the data somewhat manageable, there are nine different files, eight which contain different crimes and the "simple" file. Each file contains the data for all years. The eight categories each have crimes belonging to a major crime category and do not overlap in crimes other than with the index offenses. Please note that the crime names provided below are not the same as the column names in the data. Due to Stata limiting column names to 32 characters maximum, I have abbreviated the crime names in the data. The files and their included crimes are:

    Index Crimes
    MurderRapeRobberyAggravated AssaultBurglaryTheftMotor Vehicle TheftArsonAlcohol CrimesDUIDrunkenness
    LiquorDrug CrimesTotal DrugTotal Drug SalesTotal Drug PossessionCannabis PossessionCannabis SalesHeroin or Cocaine PossessionHeroin or Cocaine SalesOther Drug PossessionOther Drug SalesSynthetic Narcotic PossessionSynthetic Narcotic SalesGrey Collar and Property CrimesForgeryFraudStolen PropertyFinancial CrimesEmbezzlementTotal GamblingOther GamblingBookmakingNumbers LotterySex or Family CrimesOffenses Against the Family and Children
    Other Sex Offenses
    ProstitutionRapeViolent CrimesAggravated AssaultMurderNegligent ManslaughterRobberyWeapon Offenses
    Other CrimesCurfewDisorderly ConductOther Non-trafficSuspicion
    VandalismVagrancy
    Simple
    This data set has every crime and only the arrest categories that I created (see above).
    If you have any questions, comments, or suggestions please contact me at jkkaplan6@gmail.com.

  5. H

    Additional Tennessee Eastman Process Simulation Data for Anomaly Detection...

    • dataverse.harvard.edu
    • dataone.org
    Updated Jul 6, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harvard Dataverse (2017). Additional Tennessee Eastman Process Simulation Data for Anomaly Detection Evaluation [Dataset]. http://doi.org/10.7910/DVN/6C3JR1
    Explore at:
    application/x-rlang-transport(24678017)Available download formats
    Dataset updated
    Jul 6, 2017
    Dataset provided by
    Harvard Dataverse
    License

    https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/6C3JR1https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/6C3JR1

    Description

    User Agreement, Public Domain Dedication, and Disclaimer of Liability. By accessing or downloading the data or work provided here, you, the User, agree that you have read this agreement in full and agree to its terms. The person who owns, created, or contributed a work to the data or work provided here dedicated the work to the public domain and has waived his or her rights to the work worldwide under copyright law. You can copy, modify, distribute, and perform the work, for any lawful purpose, without asking permission. In no way are the patent or trademark rights of any person affected by this agreement, nor are the rights that any other person may have in the work or in how the work is used, such as publicity or privacy rights. Pacific Science & Engineering Group, Inc., its agents and assigns, make no warranties about the work and disclaim all liability for all uses of the work, to the fullest extent permitted by law. When you use or cite the work, you shall not imply endorsement by Pacific Science & Engineering Group, Inc., its agents or assigns, or by another author or affirmer of the work. This Agreement may be amended, and the use of the data or work shall be governed by the terms of the Agreement at the time that you access or download the data or work from this Website. Description This dataverse contains the data referenced in Rieth et al. (2017). Issues and Advances in Anomaly Detection Evaluation for Joint Human-Automated Systems. To be presented at Applied Human Factors and Ergonomics 2017. Each .RData file is an external representation of an R dataframe that can be read into an R environment with the 'load' function. The variables loaded are named ‘fault_free_training’, ‘fault_free_testing’, ‘faulty_testing’, and ‘faulty_training’, corresponding to the RData files. Each dataframe contains 55 columns: Column 1 ('faultNumber') ranges from 1 to 20 in the “Faulty” datasets and represents the fault type in the TEP. The “FaultFree” datasets only contain fault 0 (i.e. normal operating conditions). Column 2 ('simulationRun') ranges from 1 to 500 and represents a different random number generator state from which a full TEP dataset was generated (Note: the actual seeds used to generate training and testing datasets were non-overlapping). Column 3 ('sample') ranges either from 1 to 500 (“Training” datasets) or 1 to 960 (“Testing” datasets). The TEP variables (columns 4 to 55) were sampled every 3 minutes for a total duration of 25 hours and 48 hours respectively. Note that the faults were introduced 1 and 8 hours into the Faulty Training and Faulty Testing datasets, respectively. Columns 4 to 55 contain the process variables; the column names retain the original variable names. Acknowledgments. This work was sponsored by the Office of Naval Research, Human & Bioengineered Systems (ONR 341), program officer Dr. Jeffrey G. Morrison under contract N00014-15-C-5003. The views expressed are those of the authors and do not reflect the official policy or position of the Office of Naval Research, Department of Defense, or US Government.

  6. Tennessee Eastman Process Simulation Dataset

    • kaggle.com
    zip
    Updated Feb 9, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sergei Averkiev (2020). Tennessee Eastman Process Simulation Dataset [Dataset]. https://www.kaggle.com/averkij/tennessee-eastman-process-simulation-dataset
    Explore at:
    zip(1370814903 bytes)Available download formats
    Dataset updated
    Feb 9, 2020
    Authors
    Sergei Averkiev
    Description

    Intro

    This dataverse contains the data referenced in Rieth et al. (2017). Issues and Advances in Anomaly Detection Evaluation for Joint Human-Automated Systems. To be presented at Applied Human Factors and Ergonomics 2017.

    Content

    Each .RData file is an external representation of an R dataframe that can be read into an R environment with the 'load' function. The variables loaded are named ‘fault_free_training’, ‘fault_free_testing’, ‘faulty_testing’, and ‘faulty_training’, corresponding to the RData files.

    Each dataframe contains 55 columns:

    Column 1 ('faultNumber') ranges from 1 to 20 in the “Faulty” datasets and represents the fault type in the TEP. The “FaultFree” datasets only contain fault 0 (i.e. normal operating conditions).

    Column 2 ('simulationRun') ranges from 1 to 500 and represents a different random number generator state from which a full TEP dataset was generated (Note: the actual seeds used to generate training and testing datasets were non-overlapping).

    Column 3 ('sample') ranges either from 1 to 500 (“Training” datasets) or 1 to 960 (“Testing” datasets). The TEP variables (columns 4 to 55) were sampled every 3 minutes for a total duration of 25 hours and 48 hours respectively. Note that the faults were introduced 1 and 8 hours into the Faulty Training and Faulty Testing datasets, respectively.

    Columns 4 to 55 contain the process variables; the column names retain the original variable names.

    Acknowledgements

    This work was sponsored by the Office of Naval Research, Human & Bioengineered Systems (ONR 341), program officer Dr. Jeffrey G. Morrison under contract N00014-15-C-5003. The views expressed are those of the authors and do not reflect the official policy or position of the Office of Naval Research, Department of Defense, or US Government.

    User Agreement

    By accessing or downloading the data or work provided here, you, the User, agree that you have read this agreement in full and agree to its terms.

    The person who owns, created, or contributed a work to the data or work provided here dedicated the work to the public domain and has waived his or her rights to the work worldwide under copyright law. You can copy, modify, distribute, and perform the work, for any lawful purpose, without asking permission.

    In no way are the patent or trademark rights of any person affected by this agreement, nor are the rights that any other person may have in the work or in how the work is used, such as publicity or privacy rights.

    Pacific Science & Engineering Group, Inc., its agents and assigns, make no warranties about the work and disclaim all liability for all uses of the work, to the fullest extent permitted by law.

    When you use or cite the work, you shall not imply endorsement by Pacific Science & Engineering Group, Inc., its agents or assigns, or by another author or affirmer of the work.

    This Agreement may be amended, and the use of the data or work shall be governed by the terms of the Agreement at the time that you access or download the data or work from this Website.

  7. Mental Disorder symptoms datasets

    • kaggle.com
    Updated Aug 16, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rohit Zaman (2020). Mental Disorder symptoms datasets [Dataset]. https://www.kaggle.com/rohitzaman/mental-health-symptoms-datasets/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 16, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Rohit Zaman
    Description

    Background of creating the Datasets

    As we know there are a lot of people who have different mental disorders around the world. To recognize the disorders there are some fixed symptoms for detecting various mental disorders. For predicting any disorder it was very much important to have datasets having symptoms and mental disorder name.So I have collected symptoms and disorder name from various websites.I have used R programming language for creating the datasets.

    Description of the Datasets

    On the datasets there are 25 columns where 24 columns are boolean types and 1 column is string type.

  8. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Antje Casaretto; Pascal Coenen; Pascal Coenen; Anna Fischer; Anna Fischer; Jakob Halfmann; Jakob Halfmann; Natalie Korobzow; Natalie Korobzow; Daniel Kölligan; Daniel Kölligan; Uta Reinöhl; Uta Reinöhl; Antje Casaretto (2025). The morphologically glossed Rigveda - The Zurich annotation corpus revised and extended. Hosted by VedaWeb - Online Research Platform for Old Indic Texts. [Dataset]. http://doi.org/10.5281/zenodo.15489124

The morphologically glossed Rigveda - The Zurich annotation corpus revised and extended. Hosted by VedaWeb - Online Research Platform for Old Indic Texts.

Explore at:
binAvailable download formats
Dataset updated
May 22, 2025
Dataset provided by
Zenodo
Authors
Antje Casaretto; Pascal Coenen; Pascal Coenen; Anna Fischer; Anna Fischer; Jakob Halfmann; Jakob Halfmann; Natalie Korobzow; Natalie Korobzow; Daniel Kölligan; Daniel Kölligan; Uta Reinöhl; Uta Reinöhl; Antje Casaretto
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered
Zürich
Description

This file contains morphological and lexicographic annotations for the Rigveda. It was created in the DFG-funded research project Vedaweb and used as source data for the linguistic research platform vedaweb.uni-koeln.de.

Prof. Dr. Paul Widmer and Dr. Salvatore Scarlata from the "Institut für Vergleichende Sprachwissenschaft" (Universität Zürich) provided the VedaWeb project a Filemaker file that was later transformed in Cologne into an Excel file. This data contained a version of the Rigveda by Prof. Dr. A. Lubotsky ("Indo-European Linguistics", Leiden University) that had been morphosytactically annotated over the course of more than 10 years at the University of Zurich. It also contained for each token, if available, a reference to an entry in Grassmann's dictionary for the Rigveda.

Modifications made by Jakob Halfmann and Natalie Korobzow to the data in 2020:

Disambiguation of the relevant categories, if unspecified in Zurich data, according to the Grassmann dictionary (updates from 6th edition partially included up to page 274):

  • case, gender and number for nouns, pronouns (columns G–I)
  • number, person, mood, tense and voice for verbs (columns I–M) up to line 109216
  • case, gender, number, tense and voice for participles (columns G–I, L–M) up to line 109216
  • absolutives are marked as Abs. in columns N and V
  • Inconsistencies between the original file from Zurich and the Grassmann dictionary as well as internal inconsistencies in Grassmann are noted in column AE, whenever they were noticed.
  • Zurich data was overwritten by conflicting Grassmann data in columns G–M but retained elsewhere.
  • Verb classes according to Whitney (1885) and Jamison (1983) for class 10 in column Y, differences in root spelling between Whitney and Grassmann are noted in column Z. All potential verb classes provided by Whitney are given for every occurrence of the root.
  • Local particles and verbal forms containing them are marked as LP in column AF.
  • Comparatives and superlatives are marked as such in column X and desideratives as Des. in column Y.

Modifications made by Anna Fischer (data transformation, technical realisation) to the data:

New structure of data table for linguistic annotations with new column titles:

  • A - "VERS_NR": renamed column (from "belege::stelleMMSSSRR")
  • B - "PADA_NR": renamed column (from "belege::pada")
  • C - "PADA_TEXT_LUBOTSKY": renamed column (from "belege::lubotskypada")
  • D - "TOKEN_NR_VERS": renamed column (from "belege::wortnummer rc")
  • E - "TOKEN_NR_PADA": renamed column (from "belege::wortnummer pada")
  • F - "FORM": renamed column (from "belege::form")
  • G - "KASUS": renamed column (from "belege::kasus")
  • H - "GENUS": renamed column (from "belege::genus")
  • I - "NUMERUS": renamed column (from "belege::numerus")
  • J - "PERSON": renamed column (from "belege::person")
  • K - "TEMPUS": moved and renamed column (from L "belege::tempus")
  • L - "PRAESENSKLASSE": created new column for present stem class for each form
  • M - "LEMMA_PRAESENSKLASSEN": created column for present stem classes of respective lemma: Moved and renamed column (from Y "formen::zusätzliche merkmale verb"), moved values "Abs." and "Inf." to column P "INFINIT", moved values "Prek." and "si-Ipv." to column N "MOOD", moved value "Des." to column Q "ABGELEITETE_KONJUGATION": moved value "se-Form" to column W "WEITERE_WERTE"
  • N - "MODUS": moved and renamed column (from K "belege::modus")
  • O - "DIATHESE": moved and renamed column (from M "belege::diathese")
  • P - "INFINIT": created new column for infinite forms "Abs.", "Inf.", "Ptz.", "ta-Ptz.", "na-Ptz."
  • Q - "ABGELEITETE_KONJUGATION": created new column for secondary conjugation "Des.", "Int.", "Kaus."
  • R - "GRADUS": created new column for degree: "Comp.", "Sup."
  • S - "LOKALPARTIKEL": moved and renamed column (from AF "LP")
  • T - "LEMMA_ZÜRICH": moved and renamed column (from AA "lemmata klassisch::lemma")
  • U - "LEMMA_ZÜRICH_LEMMATYP": moved and renamed column (from AB "lemmata klassisch::lemmatyp")
  • V - "LEMMA_ZÜRICH_BEDEUTUNG": moved and renamed column (from AC "lemmata klassisch::bedeutung")
  • W - "WEITERE_WERTE": created new column for all miscellaneous values: e.g. "Hyperchar.", "n-haltig", "se-Form"
  • X - "KOMMENTAR": created new column merging former columns Z "formen::HELPformbestimmung", AD "lemmata klassisch::HELPbedeutung" and AE "anmerkungen abweichungen"

Columns that were removed due to redundant information:

  • "formen::zusätzliche merkmale nomen": values "superlative" And "comparative" were renamed "sup." and "comp." and moved to new column R "GRADUS", all other values were moved to new column for miscellaneous W "WEITERE_WERTE"
  • "belege::belegbestimmung summe simpel": values "Ptz.", "ta-Ptz." and "na-Ptz." were moved to new new column P "INFINIT"
  • "belege::kasus bestof"
  • "belege::genus bestof"
  • "belege::numerus bestof"
  • "belege::person bestof"
  • "belege::modus bestof"
  • "belege::tempus bestof"
  • "belege::diathese bestof"
  • "belege::belegbestimmung bestof summe sophistiziert"

Revisions and additions made by Antje Casaretto to the data in 2023:

  • F-T: - revision and correction (wherever necessary) of all annotations (books 1-7)
  • G,H,I - disambiguation of case forms, reg. pronouns and nominal forms, if unspecified in Zurich data (books 1-7)
  • L - disambiguation of present stem classes (book 7 and book 1 up to line 21050 vers 01.125.01)
  • M - disambiguation of denominal verbs from primary verbs of the 10th class (books 1-10)
  • N - disambiguation of precative and optative forms wherever possible (books 1-7)
  • Q - new annotations for "Int." (intensives) and "Kaus." (causatives) (books 1-7)

Revisions and additions made by Antje Casaretto to the data in 2024 with support in data modeling and automation by Anna Fischer:

  • F-T: revision and correction (wherever necessary) of all annotations (books 8-10)
  • G,H,I: disambiguation of case and gender forms in nominal and pronominal forms, if unspecified in Zurich data (books 8-10)
  • L, M: disambiguation of present stem classes (books 1-10)
  • N: disambiguation of precative and optative forms wherever possible (books 8-10)
  • P: new annotations for "Gdv." (gerundives)
  • Q: new annotations for "Den." (denominatives) (books 1-10) and further annotations of “Kaus.” (causatives) and “Int.” (intensives) (books 8-10)
  • T: revision of lemmatization (books 1-10)
  • V: update of meanings according to revised lemmatization; minimal revision
  • W: revised annotation of ending -se (“se-Form”) (books 1-10); no systematic revision
  • X: no systematic revision
  • A-U: general revision of formal inconsistencies and typing errors (book 1-10)

Revisions made by Natalie Korobzov and Pascal Coenen to the data in 2024 with computational support by Anna Fischer:

  • Y - "LEMMA_GRASSMANN_ID": new column for references to Grassmann dictionary (books 1-10) and revision of Grassmann references
Search
Clear search
Close search
Google apps
Main menu