93 datasets found
  1. NIST Statistical Reference Datasets - SRD 140

    • catalog.data.gov
    • datasets.ai
    • +2more
    Updated Jul 29, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2022). NIST Statistical Reference Datasets - SRD 140 [Dataset]. https://catalog.data.gov/dataset/nist-statistical-reference-datasets-srd-140-df30c
    Explore at:
    Dataset updated
    Jul 29, 2022
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    The purpose of this project is to improve the accuracy of statistical software by providing reference datasets with certified computational results that enable the objective evaluation of statistical software. Currently datasets and certified values are provided for assessing the accuracy of software for univariate statistics, linear regression, nonlinear regression, and analysis of variance. The collection includes both generated and 'real-world' data of varying levels of difficulty. Generated datasets are designed to challenge specific computations. These include the classic Wampler datasets for testing linear regression algorithms and the Simon & Lesage datasets for testing analysis of variance algorithms. Real-world data include challenging datasets such as the Longley data for linear regression, and more benign datasets such as the Daniel & Wood data for nonlinear regression. Certified values are 'best-available' solutions. The certification procedure is described in the web pages for each statistical method. Datasets are ordered by level of difficulty (lower, average, and higher). Strictly speaking the level of difficulty of a dataset depends on the algorithm. These levels are merely provided as rough guidance for the user. Producing correct results on all datasets of higher difficulty does not imply that your software will pass all datasets of average or even lower difficulty. Similarly, producing correct results for all datasets in this collection does not imply that your software will do the same for your particular dataset. It will, however, provide some degree of assurance, in the sense that your package provides correct results for datasets known to yield incorrect results for some software. The Statistical Reference Datasets is also supported by the Standard Reference Data Program.

  2. u

    Goodreads Book Reviews

    • cseweb.ucsd.edu
    json
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UCSD CSE Research Project, Goodreads Book Reviews [Dataset]. https://cseweb.ucsd.edu/~jmcauley/datasets.html
    Explore at:
    jsonAvailable download formats
    Dataset authored and provided by
    UCSD CSE Research Project
    Description

    These datasets contain reviews from the Goodreads book review website, and a variety of attributes describing the items. Critically, these datasets have multiple levels of user interaction, raging from adding to a shelf, rating, and reading.

    Metadata includes

    • reviews

    • add-to-shelf, read, review actions

    • book attributes: title, isbn

    • graph of similar books

    Basic Statistics:

    • Items: 1,561,465

    • Users: 808,749

    • Interactions: 225,394,930

  3. N

    Advance, IN Annual Population and Growth Analysis Dataset: A Comprehensive...

    • neilsberg.com
    csv, json
    Updated Jul 30, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neilsberg Research (2024). Advance, IN Annual Population and Growth Analysis Dataset: A Comprehensive Overview of Population Changes and Yearly Growth Rates in Advance from 2000 to 2023 // 2024 Edition [Dataset]. https://www.neilsberg.com/insights/advance-in-population-by-year/
    Explore at:
    csv, jsonAvailable download formats
    Dataset updated
    Jul 30, 2024
    Dataset authored and provided by
    Neilsberg Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    IN, Advance
    Variables measured
    Annual Population Growth Rate, Population Between 2000 and 2023, Annual Population Growth Rate Percent
    Measurement technique
    The data presented in this dataset is derived from the 20 years data of U.S. Census Bureau Population Estimates Program (PEP) 2000 - 2023. To measure the variables, namely (a) population and (b) population change in ( absolute and as a percentage ), we initially analyzed and tabulated the data for each of the years between 2000 and 2023. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
    Dataset funded by
    Neilsberg Research
    Description
    About this dataset

    Context

    The dataset tabulates the Advance population over the last 20 plus years. It lists the population for each year, along with the year on year change in population, as well as the change in percentage terms for each year. The dataset can be utilized to understand the population change of Advance across the last two decades. For example, using this dataset, we can identify if the population is declining or increasing. If there is a change, when the population peaked, or if it is still growing and has not reached its peak. We can also compare the trend with the overall trend of United States population over the same period of time.

    Key observations

    In 2023, the population of Advance was 505, a 0.40% increase year-by-year from 2022. Previously, in 2022, Advance population was 503, a decline of 0.59% compared to a population of 506 in 2021. Over the last 20 plus years, between 2000 and 2023, population of Advance decreased by 54. In this period, the peak population was 598 in the year 2009. The numbers suggest that the population has already reached its peak and is showing a trend of decline. Source: U.S. Census Bureau Population Estimates Program (PEP).

    Content

    When available, the data consists of estimates from the U.S. Census Bureau Population Estimates Program (PEP).

    Data Coverage:

    • From 2000 to 2023

    Variables / Data Columns

    • Year: This column displays the data year (Measured annually and for years 2000 to 2023)
    • Population: The population for the specific year for the Advance is shown in this column.
    • Year on Year Change: This column displays the change in Advance population for each year compared to the previous year.
    • Change in Percent: This column displays the year on year change as a percentage. Please note that the sum of all percentages may not equal one due to rounding of values.

    Good to know

    Margin of Error

    Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

    Custom data

    If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

    Inspiration

    Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

    Recommended for further research

    This dataset is a part of the main dataset for Advance Population by Year. You can refer the same here

  4. Student Performance Data Set

    • kaggle.com
    Updated Mar 27, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data-Science Sean (2020). Student Performance Data Set [Dataset]. https://www.kaggle.com/datasets/larsen0966/student-performance-data-set
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 27, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Data-Science Sean
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    If this Data Set is useful, and upvote is appreciated. This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd-period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).

  5. e

    Optimization and Evaluation Datasets for PiMine - Dataset - B2FIND

    • b2find.eudat.eu
    Updated Jun 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Optimization and Evaluation Datasets for PiMine - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/7f3c2c4e-13c5-511d-8463-47a1185959a8
    Explore at:
    Dataset updated
    Jun 7, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The protein-protein interface comparison software PiMine was developed to provide fast comparisons against databases of known protein-protein complex structures. Its application domains range from the prediction of interfaces and potential interaction partners to the identification of potential small molecule modulators of protein-protein interactions.[1] The protein-protein evaluation datasets are a collection of five datasets that were used for the parameter optimization (ParamOptSet), enrichment assessment (Dimer597 set, Keskin set, PiMineSet), and runtime analyses (RunTimeSet) of protein-protein interface comparison tools. The evaluation datasets contain pairs of interfaces of protein chains that either share sequential and structural similarities or are even sequentially and structurally unrelated. They enable comparative benchmark studies for tools designed to identify interface similarities. Data Set description: The ParamOptSet was designed based on a study on improving the benchmark datasets for the evaluation of protein-protein docking tools [2]. It was used to optimize and fine-tune the geometric search parameters of PiMine. The Dimer597 [3] and Keskin [4] sets were developed earlier. We used them to evaluate PiMine’s performance in identifying structurally and sequentially related interface pairs as well as interface pairs with prominent similarity whose constituting chains are sequentially unrelated. The PiMine set [1] was constructed to assess different quality criteria for reliable interface comparison. It consists of similar pairs of protein-protein complexes of which two chains are sequentially and structurally highly related while the other two chains are unrelated and show different folds. It enables the assessment of the performance when the interfaces of apparently unrelated chains are available only. Furthermore, we could obtain reliable interface-interface alignments based on the similar chains which can be used for alignment performance assessments. Finally, the RunTimeSet [1] comprises protein-protein complexes from the PDB that were predicted to be biologically relevant. It enables the comparison of typical run times of comparison methods and represents also an interesting dataset to screen for interface similarities. References: [1] Graef, J.; Ehrt, C.; Reim, T.; Rarey, M. Database-driven identification of structurally similar protein-protein interfaces (submitted) [2] Barradas-Bautista, D.; Almajed, A.; Oliva, R.; Kalnis, P.; Cavallo, L. Improving classification of correct and incorrect protein-protein docking models by augmenting the training set. Bioinform. Adv. 2023, 3, vbad012. [3] Gao, M.; Skolnick, J. iAlign: a method for the structural comparison of protein–protein interfaces. Bioinformatics 2010, 26, 2259-2265. [4] Keskin, O.; Tsai, C.-J.; Wolfson, H.; Nussinov, R. A new, structurally nonredundant, diverse data set of protein–protein interfaces and its implications. Protein Sci. 2004, 13, 1043-1055. This work was supported by the German Federal Ministry of Education and Research as part of CompLS and de.NBI [031L0172, 031L0105]. C.E. is funded by Data Science in Hamburg – Helmholtz Graduate School for the Structure of Matter (Grant-ID: HIDSS-0002).

  6. Z

    Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive...

    • data.niaid.nih.gov
    Updated Oct 20, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yfantidou, Sofia (2022). LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive snapshots of our lives in the wild [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6826682
    Explore at:
    Dataset updated
    Oct 20, 2022
    Dataset provided by
    Efstathiou, Stefanos
    Marchioro, Thomas
    Girdzijauskas, Šarūnas
    Ferrari, Elena
    Karagianni, Christina
    Palotti, Joao
    Vakali, Athena
    Giakatos, Dimitrios Panteleimon
    Yfantidou, Sofia
    Kazlouski, Andrei
    Description

    LifeSnaps Dataset Documentation

    Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.

    The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication.

    Data Import: Reading CSV

    For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command.

    Data Import: Setting up a MongoDB (Recommended)

    To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database.

    To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here.

    For the Fitbit data, run the following:

    mongorestore --host localhost:27017 -d rais_anonymized -c fitbit

    For the SEMA data, run the following:

    mongorestore --host localhost:27017 -d rais_anonymized -c sema

    For surveys data, run the following:

    mongorestore --host localhost:27017 -d rais_anonymized -c surveys

    If you have access control enabled, then you will need to add the --username and --password parameters to the above commands.

    Data Availability

    The MongoDB database contains three collections, fitbit, sema, and surveys, containing the Fitbit, SEMA3, and survey data, respectively. Similarly, the CSV files contain related information to these collections. Each document in any collection follows the format shown below:

    { _id: id (or user_id): type: data: }

    Each document consists of four fields: id (also found as user_id in sema and survey collections), type, and data. The _id field is the MongoDB-defined primary key and can be ignored. The id field refers to a user-specific ID used to uniquely identify each user across all collections. The type field refers to the specific data type within the collection, e.g., steps, heart rate, calories, etc. The data field contains the actual information about the document e.g., steps count for a specific timestamp for the steps type, in the form of an embedded object. The contents of the data object are type-dependent, meaning that the fields within the data object are different between different types of data. As mentioned previously, all times are stored in local time, and user IDs are common across different collections. For more information on the available data types, see the related publication.

    Surveys Encoding

    BREQ2

    Why do you engage in exercise?

        Code
        Text
    
    
        engage[SQ001]
        I exercise because other people say I should
    
    
        engage[SQ002]
        I feel guilty when I don’t exercise
    
    
        engage[SQ003]
        I value the benefits of exercise
    
    
        engage[SQ004]
        I exercise because it’s fun
    
    
        engage[SQ005]
        I don’t see why I should have to exercise
    
    
        engage[SQ006]
        I take part in exercise because my friends/family/partner say I should
    
    
        engage[SQ007]
        I feel ashamed when I miss an exercise session
    
    
        engage[SQ008]
        It’s important to me to exercise regularly
    
    
        engage[SQ009]
        I can’t see why I should bother exercising
    
    
        engage[SQ010]
        I enjoy my exercise sessions
    
    
        engage[SQ011]
        I exercise because others will not be pleased with me if I don’t
    
    
        engage[SQ012]
        I don’t see the point in exercising
    
    
        engage[SQ013]
        I feel like a failure when I haven’t exercised in a while
    
    
        engage[SQ014]
        I think it is important to make the effort to exercise regularly
    
    
        engage[SQ015]
        I find exercise a pleasurable activity
    
    
        engage[SQ016]
        I feel under pressure from my friends/family to exercise
    
    
        engage[SQ017]
        I get restless if I don’t exercise regularly
    
    
        engage[SQ018]
        I get pleasure and satisfaction from participating in exercise
    
    
        engage[SQ019]
        I think exercising is a waste of time
    

    PANAS

    Indicate the extent you have felt this way over the past week

        P1[SQ001]
        Interested
    
    
        P1[SQ002]
        Distressed
    
    
        P1[SQ003]
        Excited
    
    
        P1[SQ004]
        Upset
    
    
        P1[SQ005]
        Strong
    
    
        P1[SQ006]
        Guilty
    
    
        P1[SQ007]
        Scared
    
    
        P1[SQ008]
        Hostile
    
    
        P1[SQ009]
        Enthusiastic
    
    
        P1[SQ010]
        Proud
    
    
        P1[SQ011]
        Irritable
    
    
        P1[SQ012]
        Alert
    
    
        P1[SQ013]
        Ashamed
    
    
        P1[SQ014]
        Inspired
    
    
        P1[SQ015]
        Nervous
    
    
        P1[SQ016]
        Determined
    
    
        P1[SQ017]
        Attentive
    
    
        P1[SQ018]
        Jittery
    
    
        P1[SQ019]
        Active
    
    
        P1[SQ020]
        Afraid
    

    Personality

    How Accurately Can You Describe Yourself?

        Code
        Text
    
    
        ipip[SQ001]
        Am the life of the party.
    
    
        ipip[SQ002]
        Feel little concern for others.
    
    
        ipip[SQ003]
        Am always prepared.
    
    
        ipip[SQ004]
        Get stressed out easily.
    
    
        ipip[SQ005]
        Have a rich vocabulary.
    
    
        ipip[SQ006]
        Don't talk a lot.
    
    
        ipip[SQ007]
        Am interested in people.
    
    
        ipip[SQ008]
        Leave my belongings around.
    
    
        ipip[SQ009]
        Am relaxed most of the time.
    
    
        ipip[SQ010]
        Have difficulty understanding abstract ideas.
    
    
        ipip[SQ011]
        Feel comfortable around people.
    
    
        ipip[SQ012]
        Insult people.
    
    
        ipip[SQ013]
        Pay attention to details.
    
    
        ipip[SQ014]
        Worry about things.
    
    
        ipip[SQ015]
        Have a vivid imagination.
    
    
        ipip[SQ016]
        Keep in the background.
    
    
        ipip[SQ017]
        Sympathize with others' feelings.
    
    
        ipip[SQ018]
        Make a mess of things.
    
    
        ipip[SQ019]
        Seldom feel blue.
    
    
        ipip[SQ020]
        Am not interested in abstract ideas.
    
    
        ipip[SQ021]
        Start conversations.
    
    
        ipip[SQ022]
        Am not interested in other people's problems.
    
    
        ipip[SQ023]
        Get chores done right away.
    
    
        ipip[SQ024]
        Am easily disturbed.
    
    
        ipip[SQ025]
        Have excellent ideas.
    
    
        ipip[SQ026]
        Have little to say.
    
    
        ipip[SQ027]
        Have a soft heart.
    
    
        ipip[SQ028]
        Often forget to put things back in their proper place.
    
    
        ipip[SQ029]
        Get upset easily.
    
    
        ipip[SQ030]
        Do not have a good imagination.
    
    
        ipip[SQ031]
        Talk to a lot of different people at parties.
    
    
        ipip[SQ032]
        Am not really interested in others.
    
    
        ipip[SQ033]
        Like order.
    
    
        ipip[SQ034]
        Change my mood a lot.
    
    
        ipip[SQ035]
        Am quick to understand things.
    
    
        ipip[SQ036]
        Don't like to draw attention to myself.
    
    
        ipip[SQ037]
        Take time out for others.
    
    
        ipip[SQ038]
        Shirk my duties.
    
    
        ipip[SQ039]
        Have frequent mood swings.
    
    
        ipip[SQ040]
        Use difficult words.
    
    
        ipip[SQ041]
        Don't mind being the centre of attention.
    
    
        ipip[SQ042]
        Feel others' emotions.
    
    
        ipip[SQ043]
        Follow a schedule.
    
    
        ipip[SQ044]
        Get irritated easily.
    
    
        ipip[SQ045]
        Spend time reflecting on things.
    
    
        ipip[SQ046]
        Am quiet around strangers.
    
    
        ipip[SQ047]
        Make people feel at ease.
    
    
        ipip[SQ048]
        Am exacting in my work.
    
    
        ipip[SQ049]
        Often feel blue.
    
    
        ipip[SQ050]
        Am full of ideas.
    

    STAI

    Indicate how you feel right now

        Code
        Text
    
    
        STAI[SQ001]
        I feel calm
    
    
        STAI[SQ002]
        I feel secure
    
    
        STAI[SQ003]
        I am tense
    
    
        STAI[SQ004]
        I feel strained
    
    
        STAI[SQ005]
        I feel at ease
    
    
        STAI[SQ006]
        I feel upset
    
    
        STAI[SQ007]
        I am presently worrying over possible misfortunes
    
    
        STAI[SQ008]
        I feel satisfied
    
    
        STAI[SQ009]
        I feel frightened
    
    
        STAI[SQ010]
        I feel comfortable
    
    
        STAI[SQ011]
        I feel self-confident
    
    
        STAI[SQ012]
        I feel nervous
    
    
        STAI[SQ013]
        I am jittery
    
    
        STAI[SQ014]
        I feel indecisive
    
    
        STAI[SQ015]
        I am relaxed
    
    
        STAI[SQ016]
        I feel content
    
    
        STAI[SQ017]
        I am worried
    
    
        STAI[SQ018]
        I feel confused
    
    
        STAI[SQ019]
        I feel steady
    
    
        STAI[SQ020]
        I feel pleasant
    

    TTM

    Do you engage in regular physical activity according to the definition above? How frequently did each event or experience occur in the past month?

        Code
        Text
    
    
        processes[SQ002]
        I read articles to learn more about physical
    
  7. Kaggle Datasets - Summary, Topics, Classification

    • kaggle.com
    Updated Nov 16, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Katherine Marsh (2020). Kaggle Datasets - Summary, Topics, Classification [Dataset]. https://www.kaggle.com/katherinemarsh/kaggle-datasets-summary-topics-classification/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 16, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Katherine Marsh
    Description

    Context

    Companies and individuals are storing increasingly more data digitally; however, much of the data is unused because it is unclassified. How many times have you opened your downloads folder, found a file you downloaded a year ago and you have no idea what the contents are? You can read through those files individually but imagine doing that for thousands of files. All that raw data in storage facilities create data lakes. As the amount of data grows and the complexity rises, data lakes become data swamps. The potentially valuable and interesting datasets will likely remain unused. Our tool addresses the need to classify these large pools of data in a visually effective and succinct manner by identifying keywords in datasets, and classifying datasets into a consistent taxonomy.

    The files listed within kaggleDatasetSummaryTopicsClassification.csv have been processed with our tool to generate the keywords and taxonomic classification as seen below. The summaries are not generated from our system. Instead they were retrieved from user input as they uploaded the files on Kaggle. We planned to utilize these summaries to create an NLG model to generate summaries from any input file. Unfortunately we were not able to collect enough data to build a good model. Hopefully the data within this set might help future users achieve that goal.

    Acknowledgements

    Developed with Senior Design Center at NC State in collaboration with SAS. Senior Design Team: Tanya Chu, Katherine Marsh, Nikhil Milind, Anna Owens SAS Representatives: : Nancy Rausch, Marty Warner, Brant Kay, Tyler Wendell, JP Trawinski

  8. A

    ‘Countries of the World’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Nov 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘Countries of the World’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-countries-of-the-world-00c4/2cca4656/?iid=005-843&v=presentation
    Explore at:
    Dataset updated
    Nov 12, 2021
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    World
    Description

    Analysis of ‘Countries of the World’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/fernandol/countries-of-the-world on 12 November 2021.

    --- Dataset description provided by original source is as follows ---

    Context

    World fact sheet, fun to link with other datasets.

    Content

    Information on population, region, area size, infant mortality and more.

    Acknowledgements

    Source: All these data sets are made up of data from the US government. Generally they are free to use if you use the data in the US. If you are outside of the US, you may need to contact the US Govt to ask. Data from the World Factbook is public domain. The website says "The World Factbook is in the public domain and may be used freely by anyone at anytime without seeking permission."
    https://www.cia.gov/library/publications/the-world-factbook/docs/faqs.html

    Inspiration

    When making visualisations related to countries, sometimes it is interesting to group them by attributes such as region, or weigh their importance by population, GDP or other variables.

    --- Original source retains full ownership of the source dataset ---

  9. f

    A Meta-Analysis of Multiple Matched Copy Number and Transcriptomics Data...

    • plos.figshare.com
    pdf
    Updated Jun 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Richard Newton; Lorenz Wernisch (2023). A Meta-Analysis of Multiple Matched Copy Number and Transcriptomics Data Sets for Inferring Gene Regulatory Relationships [Dataset]. http://doi.org/10.1371/journal.pone.0105522
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 9, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Richard Newton; Lorenz Wernisch
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Inferring gene regulatory relationships from observational data is challenging. Manipulation and intervention is often required to unravel causal relationships unambiguously. However, gene copy number changes, as they frequently occur in cancer cells, might be considered natural manipulation experiments on gene expression. An increasing number of data sets on matched array comparative genomic hybridisation and transcriptomics experiments from a variety of cancer pathologies are becoming publicly available. Here we explore the potential of a meta-analysis of thirty such data sets. The aim of our analysis was to assess the potential of in silico inference of trans-acting gene regulatory relationships from this type of data. We found sufficient correlation signal in the data to infer gene regulatory relationships, with interesting similarities between data sets. A number of genes had highly correlated copy number and expression changes in many of the data sets and we present predicted potential trans-acted regulatory relationships for each of these genes. The study also investigates to what extent heterogeneity between cell types and between pathologies determines the number of statistically significant predictions available from a meta-analysis of experiments.

  10. e

    30 years of synoptic observations from Neumayer Station with links to...

    • data.europa.eu
    • doi.pangaea.de
    • +1more
    unknown
    Updated Feb 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PANGAEA (2022). 30 years of synoptic observations from Neumayer Station with links to datasets [Dataset]. https://data.europa.eu/data/datasets/de-pangaea-dataset150017?locale=cs
    Explore at:
    unknownAvailable download formats
    Dataset updated
    Feb 5, 2022
    Dataset authored and provided by
    PANGAEA
    Description

    The analysis of research data plays a key role in data-driven areas of science. Varieties of mixed research data sets exist and scientists aim to derive or validate hypotheses to find undiscovered knowledge. Many analysis techniques identify relations of an entire dataset only. This may level the characteristic behavior of different subgroups in the data. Like automatic subspace clustering, we aim at identifying interesting subgroups and attribute sets. We present a visual-interactive system that supports scientists to explore interesting relations between aggregated bins of multivariate attributes in mixed data sets. The abstraction of data to bins enables the application of statistical dependency tests as the measure of interestingness. An overview matrix view shows all attributes, ranked with respect to the interestingness of bins. Complementary, a node-link view reveals multivariate bin relations by positioning dependent bins close to each other. The system supports information drill-down based on both expert knowledge and algorithmic support. Finally, visual-interactive subset clustering assigns multivariate bin relations to groups. A list-based cluster result representation enables the scientist to communicate multivariate findings at a glance. We demonstrate the applicability of the system with two case studies from the earth observation domain and the prostate cancer research domain. In both cases, the system enabled us to identify the most interesting multivariate bin relations, to validate already published results, and, moreover, to discover unexpected relations.

  11. e

    Reference list of 265 sources used for the discovery of relationships...

    • b2find.eudat.eu
    Updated Jul 9, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The citation is currently not available for this dataset.
    Explore at:
    Dataset updated
    Jul 9, 2012
    Description

    Visual cluster analysis provides valuable tools that help analysts to understand large data sets in terms of representative clusters and relationships thereof. Often, the found clusters are to be understood in context of belonging categorical, numerical or textual metadata which are given for the data elements. While often not part of the clustering process, such metadata play an important role and need to be considered during the interactive cluster exploration process. Traditionally, linked-views allow to relate (or loosely speaking: correlate) clusters with metadata or other properties of the underlying cluster data. Manually inspecting the distribution of metadata for each cluster in a linked-view approach is tedious, specially for large data sets, where a large search problem arises. Fully interactive search for potentially useful or interesting cluster to metadata relationships may constitute a cumbersome and long process. To remedy this problem, we propose a novel approach for guiding users in discovering interesting relationships between clusters and associated metadata. Its goal is to guide the analyst through the potentially huge search space. We focus in our work on metadata of categorical type, which can be summarized for a cluster in form of a histogram. We start from a given visual cluster representation, and compute certain measures of interestingness defined on the distribution of metadata categories for the clusters. These measures are used to automatically score and rank the clusters for potential interestingness regarding the distribution of categorical metadata. Identified interesting relationships are highlighted in the visual cluster representation for easy inspection by the user. We present a system implementing an encompassing, yet extensible, set of interestingness scores for categorical metadata, which can also be extended to numerical metadata. Appropriate visual representations are provided for showing the visual correlations, as well as the calculated ranking scores. Focusing on clusters of time series data, we test our approach on a large real-world data set of time-oriented scientific research data, demonstrating how specific interesting views are automatically identified, supporting the analyst discovering interesting and visually understandable relationships. The dataset contains 265 links (childs) to any of the BSRN datasets. Any user who accepts the BSRN data release guidelines (http://bsrn.awi.de/data/conditions-of-data-release) may ask Gert König-Langlo (mailto:Gert.Koenig-Langlo@awi.de) to obtain an account to download these datasets.

  12. Yelp COVID-19 Features

    • kaggle.com
    Updated Jun 23, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexis Han (2021). Yelp COVID-19 Features [Dataset]. https://www.kaggle.com/alexzixinhan/yelp-covid19-features
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 23, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Alexis Han
    Description

    Source

    Please refer to Yelp for the original JSON file and other datasets. This dataset was created in June 2020 by Yelp. The usage of this dataset should be for academic purposes.

    Content

    I read the JSON file in Python and convert it to three CSV files:

    • covid_features.csv contains all the data (9 features)
    • covid_features_banners.csv contains all records that have covid banners, which is good for text analysis and creating graphs like a word cloud
    • covid_features_highlights.csv contains all records that have highlights (not FALSE), which you can see it's like a dictionary and contains more data in it.

    Acknowledgements

    Please read Dataset_User_Agreement.pdf before you proceed with all data files.

    Inspiration

    It would be interesting to see how virtual services were offered by restaurants during COVID in 2020 and how restaurant businesses strived to communicate and connect with customers on Yelp. There is no numeric data to play with, however, it's still valuable to do some visualizations.

  13. Twitter Tweets Sentiment Dataset

    • kaggle.com
    Updated Apr 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    M Yasser H (2022). Twitter Tweets Sentiment Dataset [Dataset]. https://www.kaggle.com/datasets/yasserh/twitter-tweets-sentiment-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 8, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    M Yasser H
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    https://raw.githubusercontent.com/Masterx-AI/Project_Twitter_Sentiment_Analysis_/main/twitt.jpg" alt="">

    Description:

    Twitter is an online Social Media Platform where people share their their though as tweets. It is observed that some people misuse it to tweet hateful content. Twitter is trying to tackle this problem and we shall help it by creating a strong NLP based-classifier model to distinguish the negative tweets & block such tweets. Can you build a strong classifier model to predict the same?

    Each row contains the text of a tweet and a sentiment label. In the training set you are provided with a word or phrase drawn from the tweet (selected_text) that encapsulates the provided sentiment.

    Make sure, when parsing the CSV, to remove the beginning / ending quotes from the text field, to ensure that you don't include them in your training.

    You're attempting to predict the word or phrase from the tweet that exemplifies the provided sentiment. The word or phrase should include all characters within that span (i.e. including commas, spaces, etc.)

    Columns:

    1. textID - unique ID for each piece of text
    2. text - the text of the tweet
    3. sentiment - the general sentiment of the tweet

    Acknowledgement:

    The dataset is download from Kaggle Competetions:
    https://www.kaggle.com/c/tweet-sentiment-extraction/data?select=train.csv

    Objective:

    • Understand the Dataset & cleanup (if required).
    • Build classification models to predict the twitter sentiments.
    • Compare the evaluation metrics of vaious classification algorithms.
  14. A

    ‘US Figure Skating Data - 2020 Regionals Int Ladies’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Jan 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘US Figure Skating Data - 2020 Regionals Int Ladies’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-us-figure-skating-data-2020-regionals-int-ladies-de39/8600c042/?iid=013-632&v=presentation
    Explore at:
    Dataset updated
    Jan 28, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    United States
    Description

    Analysis of ‘US Figure Skating Data - 2020 Regionals Int Ladies’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/katiewerner/us-figure-skating-data-2020-regionals-int-ladies on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    Context

    US Figure Skating hosts several competitions at the regional level; top skaters from these competitions progress to the next echelon of competition. Depending on the region, there may be qualifying events before the final round to determine the top skaters. In a qualifying round, skaters perform their free skate program for a panel of judges; top qualifiers move on to the final round. The final round consists of each skater performing a short program and their free skate program; the official panels are not necessarily the same between the short program and free skate program events (likewise for qualifying rounds).

    Skaters receive scores based on: technical elements, program components, and any deductions incurred. Technical elements performed are assessed by the technical officials on the panel of officials, and the element is assigned a base value. This base value has several considerations beyond just the element performed - for example, if a jump is under-rotated this affects the base value, and if the skater performs an element that receives a bonus, the bonus is added to the base value. The judges on the panel of officials then assign each technical element a grade of execution mark, which may be integers in the range from -5 to +5. The base value is then scaled by the average GOE, dropping the highest and lowest GOEs.

    Judges on the official panel are also responsible for assigning program component scores, which consider interpretation of the music, program choreography/composition, performance, transitions between and going into the technical elements, and overall skating skills. Some skating levels have scores for some of the components. Component marks are adjusted based on removing high and low scores, and also adjusted by a set of scalar multipliers, as provided by US Figure Skating.

    Deductions may be incurred for time violations, falls, or costume/prop issues. Fall deductions may be more severe depending on the level.

    Content

    This dataset packaging includes 5 csv files that contain the scraped data from the 2020 Regional Intermedaite Ladies events: - Event Details (event index, round, program type) - Official Details (judge and technical official names, cities) - Skater Details (skater placement, skate order, club, overall scores and deductions) - Technical Score Details (base values, GOEs) - Component Score Details (component scores)

    The event index and skater placement columns are keys that can be used to join these datasets together, as appropriate.

    Acknowledgements

    Many thanks to my husband, for encouraging me to tackle this project during my personal sabbatical. Special shoutout to James Robertson and Joshua Merry for giving this dataset publication a quick once over.

    As ever, I also owe thanks to stackoverflow for inspiring code solutions when all hope is seemingly lost.

    Photo credit: Greg Pembroke

    Inspiration

    I created this dataset as simply a fun data project, being data I am intimately familiar and fascinated with the creation of -- indeed, I helped create this data back in October 2019 by serving as a judge on some of these panels! I intend to publish another notebook containing EDA and some models in the near future - stay tuned!

    The scripts and notebooks provided are strictly intended to serve as an example to demonstrate my approach to working with data. It is ABSOLUTELY NOT intended to be used to draw any definitive conclusions about: the sport of figure skating, the judges, the technical officials, the skaters, the coaches, the skating clubs, US Figure Skating, any region or section of US Figure Skating, etc.

    --- Original source retains full ownership of the source dataset ---

  15. SRM Resting-state EEG

    • openneuro.org
    Updated Nov 23, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christoffer Hatlestad-Hall; Trine Waage Rygvold; Stein Andersson (2022). SRM Resting-state EEG [Dataset]. http://doi.org/10.18112/openneuro.ds003775.v1.2.1
    Explore at:
    Dataset updated
    Nov 23, 2022
    Dataset provided by
    OpenNeurohttps://openneuro.org/
    Authors
    Christoffer Hatlestad-Hall; Trine Waage Rygvold; Stein Andersson
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    SRM Resting-state EEG

    Introduction

    This EEG dataset contains resting-state EEG extracted from the experimental paradigm used in the Stimulus-Selective Response Modulation (SRM) project at the Dept. of Psychology, University of Oslo, Norway.

    The data is recorded with a BioSemi ActiveTwo system, using 64 electrodes following the positional scheme of the extended 10-20 system (10-10). Each datafile comprises four minutes of uninterrupted EEG acquired while the subjects were resting with their eyes closed. The dataset includes EEG from 111 healthy control subjects (the "t1" session), of which a number underwent an additional EEG recording at a later date (the "t2" session). Thus, some subjects have one associated EEG file, whereas others have two.

    Disclaimer

    The dataset is provided "as is". Hereunder, the authors take no responsibility with regard to data quality. The user is solely responsible for ascertaining that the data used for publications or in other contexts fulfil the required quality criteria.

    The data

    Raw data files

    The raw EEG data signals are rereferenced to the average reference. Other than that, no operations have been performed on the data. The files contain no events; the whole continuous segment is resting-state data. The data signals are unfiltered (recorded in Europe, the line noise frequency is 50 Hz). The time points for the subject's EEG recording(s), are listed in the *_scans.tsv file (particularly interesting for the subjects with two recordings).

    Please note that the quality of the raw data has not been carefully assessed. While most data files are of high quality, a few might be of poorer quality. The data files are provided "as is", and it is the user's esponsibility to ascertain the quality of the individual data file.

    /derivatives/cleaned_data

    For convenience, a cleaned dataset is provided. The files in this derived dataset have been preprocessed with a basic, fully automated pipeline (see /code/s2_preprocess.m for details) directory for details. The derived files are stored as EEGLAB .set files in a directory structure identical to that of the raw files. Please note that the *_channels.tsv files associated with the derived files have been updated with status information about each channel ("good" or "bad"). The "bad" channels are – for the sake of consistency – interpolated, and thus still present in the data. It might be advisable to remove these channels in some analyses, as they (per definition) do not provide anything to the EEG data. The cleaned data signals are referenced to the average reference (including the interpolated channels).

    Please mind the automatic nature of the employed pipeline. It might not perform optimally on all data files (e.g. over-/underestimating proportion of bad channels). For publications, we recommend implementing a more sensitive cleaning pipeline.

    Demographic and cognitive test data

    The participants.tsv file in the root folder contains the variables age, sex, and a range of cognitive test scores. See the sidecar participants.json for more information on the behavioural measures. Please note that these measures were collected in connection with the "t1" session recording.

    How to cite

    All use of this dataset in a publication context requires the following paper to be cited:

    Hatlestad-Hall, C., Rygvold, T. W., & Andersson, S. (2022). BIDS-structured resting-state electroencephalography (EEG) data extracted from an experimental paradigm. Data in Brief, 45, 108647. https://doi.org/10.1016/j.dib.2022.108647

    Contact

    Questions regarding the EEG data may be addressed to Christoffer Hatlestad-Hall (chr.hh@pm.me).

    Question regarding the project in general may be addressed to Stein Andersson (stein.andersson@psykologi.uio.no) or Trine W. Rygvold (t.w.rygvold@psykologi.uio.no).

  16. f

    Data from: Chemical Topic Modeling: Exploring Molecular Data Sets Using a...

    • acs.figshare.com
    • figshare.com
    zip
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nadine Schneider; Nikolas Fechner; Gregory A. Landrum; Nikolaus Stiefl (2023). Chemical Topic Modeling: Exploring Molecular Data Sets Using a Common Text-Mining Approach [Dataset]. http://doi.org/10.1021/acs.jcim.7b00249.s003
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    ACS Publications
    Authors
    Nadine Schneider; Nikolas Fechner; Gregory A. Landrum; Nikolaus Stiefl
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Big data is one of the key transformative factors which increasingly influences all aspects of modern life. Although this transformation brings vast opportunities it also generates novel challenges, not the least of which is organizing and searching this data deluge. The field of medicinal chemistry is not different: more and more data are being generated, for instance, by technologies such as DNA encoded libraries, peptide libraries, text mining of large literature corpora, and new in silico enumeration methods. Handling those huge sets of molecules effectively is quite challenging and requires compromises that often come at the expense of the interpretability of the results. In order to find an intuitive and meaningful approach to organizing large molecular data sets, we adopted a probabilistic framework called “topic modeling” from the text-mining field. Here we present the first chemistry-related implementation of this method, which allows large molecule sets to be assigned to “chemical topics” and investigating the relationships between those. In this first study, we thoroughly evaluate this novel method in different experiments and discuss both its disadvantages and advantages. We show very promising results in reproducing human-assigned concepts using the approach to identify and retrieve chemical series from sets of molecules. We have also created an intuitive visualization of the chemical topics output by the algorithm. This is a huge benefit compared to other unsupervised machine-learning methods, like clustering, which are commonly used to group sets of molecules. Finally, we applied the new method to the 1.6 million molecules of the ChEMBL22 data set to test its robustness and efficiency. In about 1 h we built a 100-topic model of this large data set in which we could identify interesting topics like “proteins”, “DNA”, or “steroids”. Along with this publication we provide our data sets and an open-source implementation of the new method (CheTo) which will be part of an upcoming version of the open-source cheminformatics toolkit RDKit.

  17. A

    ‘Premier League’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Nov 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘Premier League’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-premier-league-37de/8118be0f/?iid=073-634&v=presentation
    Explore at:
    Dataset updated
    Nov 12, 2021
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Premier League’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/zaeemnalla/premier-league on 12 November 2021.

    --- Dataset description provided by original source is as follows ---

    Context

    Official football data organised and formatted in csv files ready for download is quite hard to come by. Stats providers are hesitant to release their data to anyone and everyone, even if it's for academic purposes. That was my exact dilemma which prompted me to scrape and extract it myself. Now that it's at your disposal, have fun with it.

    Content

    The data was acquired from the Premier League website and is representative of seasons 2006/2007 to 2017/2018. Visit both sets to get a detailed description of what each entails.

    Inspiration

    Use it to the best of your ability to predict match outcomes or for a thorough data analysis to uncover some intriguing insights. Be safe and only use this dataset for personal projects. If you'd like to use this type of data for a commercial project, contact Opta to access it through their API instead.

    --- Original source retains full ownership of the source dataset ---

  18. u

    Data from: Plant Expression Database

    • agdatacommons.nal.usda.gov
    • s.cnmilf.com
    • +2more
    bin
    Updated Feb 9, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sudhansu S. Dash; John Van Hemert; Lu Hong; Roger P. Wise; Julie A. Dickerson (2024). Plant Expression Database [Dataset]. https://agdatacommons.nal.usda.gov/articles/dataset/Plant_Expression_Database/24661179
    Explore at:
    binAvailable download formats
    Dataset updated
    Feb 9, 2024
    Dataset provided by
    PLEXdb
    Authors
    Sudhansu S. Dash; John Van Hemert; Lu Hong; Roger P. Wise; Julie A. Dickerson
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    [NOTE: PLEXdb is no longer available online. Oct 2019.] PLEXdb (Plant Expression Database) is a unified gene expression resource for plants and plant pathogens. PLEXdb is a genotype to phenotype, hypothesis building information warehouse, leveraging highly parallel expression data with seamless portals to related genetic, physical, and pathway data. PLEXdb (http://www.plexdb.org), in partnership with community databases, supports comparisons of gene expression across multiple plant and pathogen species, promoting individuals and/or consortia to upload genome-scale data sets to contrast them to previously archived data. These analyses facilitate the interpretation of structure, function and regulation of genes in economically important plants. A list of Gene Atlas experiments highlights data sets that give responses across different developmental stages, conditions and tissues. Tools at PLEXdb allow users to perform complex analyses quickly and easily. The Model Genome Interrogator (MGI) tool supports mapping gene lists onto corresponding genes from model plant organisms, including rice and Arabidopsis. MGI predicts homologies, displays gene structures and supporting information for annotated genes and full-length cDNAs. The gene list-processing wizard guides users through PLEXdb functions for creating, analyzing, annotating and managing gene lists. Users can upload their own lists or create them from the output of PLEXdb tools, and then apply diverse higher level analyses, such as ANOVA and clustering. PLEXdb also provides methods for users to track how gene expression changes across many different experiments using the Gene OscilloScope. This tool can identify interesting expression patterns, such as up-regulation under diverse conditions or checking any gene’s suitability as a steady-state control. Resources in this dataset:Resource Title: Website Pointer for Plant Expression Database, Iowa State University. File Name: Web Page, url: https://www.bcb.iastate.edu/plant-expression-database [NOTE: PLEXdb is no longer available online. Oct 2019.] Project description for the Plant Expression Database (PLEXdb) and integrated tools.

  19. h

    awesome-chatgpt-prompts

    • huggingface.co
    Updated Dec 15, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fatih Kadir Akın (2023). awesome-chatgpt-prompts [Dataset]. https://huggingface.co/datasets/fka/awesome-chatgpt-prompts
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 15, 2023
    Authors
    Fatih Kadir Akın
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    🧠 Awesome ChatGPT Prompts [CSV dataset]

    This is a Dataset Repository of Awesome ChatGPT Prompts View All Prompts on GitHub

      License
    

    CC-0

  20. E

    A Dataset of Scientific Topics

    • live.european-language-grid.eu
    • data.niaid.nih.gov
    Updated Apr 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). A Dataset of Scientific Topics [Dataset]. https://live.european-language-grid.eu/catalogue/lcr/18328
    Explore at:
    Dataset updated
    Apr 23, 2022
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The automatic extraction of topics is a standard technique for summarizing text corpora from various domains (e.g., news articles, transport or logistic reports, scientific publications) that has several applications. Since, in many cases, topics are subject to continuous change there is the need to monitor the evolution of a set of topics of interest, as the corresponding corpora are updated. The evolution of scientific topics, in particular, is of great interest for researchers, policy makers, fund managers, and other professionals/engineers in the research and academic community. In this dataset, we provide a set of topics for scientific publications gathered from Crossref. The topics have been produced by performing a topic modeling analysis on two distinct sets of publications, each coming from a different time period. Acknowledgements: This research was partially funded by project ENIRISST under grant agreement No. MIS 5027930 (co-financed by Greece and the EU through the European Regional Development Fund).

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
National Institute of Standards and Technology (2022). NIST Statistical Reference Datasets - SRD 140 [Dataset]. https://catalog.data.gov/dataset/nist-statistical-reference-datasets-srd-140-df30c
Organization logo

NIST Statistical Reference Datasets - SRD 140

Explore at:
Dataset updated
Jul 29, 2022
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description

The purpose of this project is to improve the accuracy of statistical software by providing reference datasets with certified computational results that enable the objective evaluation of statistical software. Currently datasets and certified values are provided for assessing the accuracy of software for univariate statistics, linear regression, nonlinear regression, and analysis of variance. The collection includes both generated and 'real-world' data of varying levels of difficulty. Generated datasets are designed to challenge specific computations. These include the classic Wampler datasets for testing linear regression algorithms and the Simon & Lesage datasets for testing analysis of variance algorithms. Real-world data include challenging datasets such as the Longley data for linear regression, and more benign datasets such as the Daniel & Wood data for nonlinear regression. Certified values are 'best-available' solutions. The certification procedure is described in the web pages for each statistical method. Datasets are ordered by level of difficulty (lower, average, and higher). Strictly speaking the level of difficulty of a dataset depends on the algorithm. These levels are merely provided as rough guidance for the user. Producing correct results on all datasets of higher difficulty does not imply that your software will pass all datasets of average or even lower difficulty. Similarly, producing correct results for all datasets in this collection does not imply that your software will do the same for your particular dataset. It will, however, provide some degree of assurance, in the sense that your package provides correct results for datasets known to yield incorrect results for some software. The Statistical Reference Datasets is also supported by the Standard Reference Data Program.

Search
Clear search
Close search
Google apps
Main menu