Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Details of experiments are given in the paper, titled 'Super.Complex: A supervised machine learning pipeline for molecular complex detection in protein-interaction networks.': https://doi.org/10.1371/journal.pone.0262056
For additional details, please see https://sites.google.com/view/supercomplex/super-complex-v3-0
Supporting code is available on github at: https://github.com/marcottelab/super.complex
Details of files provided for each experiment are given below:
Toy network experiment
Input data:
Toy network, available as a weighted edge list. Format: node1 node2 edge-weight
All raw toy communities, available as node lists. Format: node1 node2 node3 .. Each line represents a community.
Intermediate output results:
Training toy communities, available as node lists. Format: node1 node2 node3 .. Each line represents a community.
Testing toy communities, available as node lists. Format: node1 node2 node3 .. Each line represents a community.
Training toy communities feature matrix, available as a spreadsheet of rows of feature vectors, each corresponding to a positive or negative community. Format: density, number of nodes, degree statistics- maximum, mean, median and variance, clustering coefficient (CC) statistics- maximum, mean and variance, edge weight statistics- mean, maximum and variance, degree correlation (DC) statistics - mean, variance and maximum, 3 Singular values of the subgraph's adjacency matrix, Label (positive or negative community, indicated by 1 or 0 respectively)
Testing toy communities feature matrix, available as a spreadsheet of rows of feature vectors, each corresponding to a positive or negative community. Format: density, number of nodes, degree statistics- maximum, mean, median and variance, clustering coefficient (CC) statistics- maximum, mean and variance, edge weight statistics- mean, maximum and variance, degree correlation (DC) statistics - mean, variance and maximum, 3 Singular values of the subgraph's adjacency matrix, Label (positive or negative community, indicated by 1 or 0 respectively)
Output results:
Trained toy community fitness function, available as pickled files of a data pre-processor and a machine learning model from sklearn, that can be imported in python using the pickle module, for example using the commands, import pickle; pickle.load(filename)
Learned toy communities, available as node lists. Format: node1 node2 node3 .. nodeN Score. Each line represents a community. The score is the community fitness function of the community.
Learned toy communities, available as edge lists. Format: node1 node2 edge-weight. A blank line (i.e. two newline characters) separates the edges of one community from another community's edges.
hu.MAP experiment:
Input data:
hu.MAP PPI (protein-protein interaction) network, available as a weighted edge list. Format: gene_ID1 gene_ID2 edge-weight
All raw human protein complexes from CORUM, available as node lists. Format: gene_ID1 gene_ID2 gene_ID3 .. Each line represents a protein complex.
Intermediate output results:
Training complexes from CORUM, available as node lists. Format: gene_ID1 gene_ID2 gene_ID3 .. Each line represents a protein complex.
Testing complexes from CORUM, available as node lists. Format: gene_ID1 gene_ID2 gene_ID3 .. Each line represents a protein complex.
Training data, i.e. feature matrix of CORUM complexes (with edge weights from hu.MAP PPI network), available as a spreadsheet of rows of feature vectors, each corresponding to a positive or negative protein complex. Format: density, number of nodes, degree statistics- maximum, mean, median and variance, clustering coefficient (CC) statistics- maximum, mean and variance, edge weight statistics- mean, maximum and variance, degree correlation (DC) statistics - mean, variance and maximum, 3 Singular values of the subgraph's adjacency matrix, Label (positive or negative protein complex, indicated by 1 or 0 respectively)
Testing data, i.e. feature matrix of CORUM complexes (with edge weights from hu.MAP PPI network), available as a spreadsheet of rows of feature vectors, each corresponding to a positive or negative protein complex. Format: density, number of nodes, degree statistics- maximum, mean, median and variance, clustering coefficient (CC) statistics- maximum, mean and variance, edge weight statistics- mean, maximum and variance, degree correlation (DC) statistics - mean, variance and maximum, 3 Singular values of the subgraph's adjacency matrix, Label (positive or negative protein complex, indicated by 1 or 0 respectively)
Output results:
Trained community fitness function of CORUM complexes (with edge weights from hu.MAP), available as pickled files of a data pre-processor and a machine learning model from sklearn, that can be imported in python using the pickle module, for example using the commands, import pickle; pickle.load(filename)
Learned protein complexes from hu.MAP PPI network, available as node lists. Format: Excel file, where the columns are - Learned complex name (Named as the most similar CORUM complex, prepended by the Jaccard coefficient similarity) , Proteins in learned complex (gene names, i.e gene_name1 gene_name2 gene_name3 .. gene_nameN ), Proteins in learned complex (gene IDs, i.e gene_ID1 gene_ID2 gene_ID3 .. gene_IDN ) and Score (Community fitness function of the learned protein complex)
Learned protein complexes from hu.MAP PPI network, available as gene ID edge lists. Format: gene_ID1 gene_ID2 edge-weight. A blank line (i.e. two newline characters) separates the edges of one complex from another complex's edges.
Learned protein complexes from hu.MAP PPI network, available as gene name edge lists. Format: gene_name1 gene_name2 edge-weight. A blank line (i.e. two newline characters) separates the edges of one complex from another complex's edges.
Yeast experiments:
Input data:
DIP yeast PPI network, available as a weighted edge list. Format: gene_ID1* gene_ID2 edge-weight
Yeast protein complexes from MIPS, available as node lists. Format: gene_ID1 gene_ID2 gene_ID3 .. Each line represents a protein complex.
Yeast protein complexes from TAP-MS, available as node lists. Format: gene_ID1 gene_ID2 gene_ID3 .. Each line represents a protein complex.
Experiment 1: Training on TAP-MS and Testing on MIPS:
Experiment 1 Intermediate output results:
Training data, i.e. feature matrix of TAP-MS complexes (with edge weights from hu.MAP), available as a spreadsheet of rows of feature vectors, each corresponding to a positive or negative protein complex. Format: density, number of nodes, degree statistics- maximum, mean, median and variance, clustering coefficient (CC) statistics- maximum, mean and variance, edge weight statistics- mean, maximum and variance, degree correlation (DC) statistics - mean, variance and maximum, 3 Singular values of the subgraph's adjacency matrix, Label (positive or negative protein complex, indicated by 1 or 0 respectively)
Testing data, i.e. feature matrix of MIPS complexes (with edge weights from hu.MAP), available as a spreadsheet of rows of feature vectors, each corresponding to a positive or negative protein complex. Format: density, number of nodes, degree statistics- maximum, mean, median and variance, clustering coefficient (CC) statistics- maximum, mean and variance, edge weight statistics- mean, maximum and variance, degree correlation (DC) statistics - mean, variance and maximum, 3 Singular values of the subgraph's adjacency matrix, Label (positive or negative protein complex, indicated by 1 or 0 respectively)
Experiment 1 output results:
Trained community fitness function of TAP-MS complexes (with edge weights from DIP PPI network), available as pickled files of a data pre-processor and a machine learning model from sklearn, that can be imported in python using the pickle module, for example using the commands, import pickle; pickle.load(filename)
Learned protein complexes from DIP PPI network, available as node lists. Format: gene_ID1 gene_ID2 gene_ID3 .. gene_IDN Score. Each line represents a protein complex. The score is the community fitness function of the protein complex.
Learned protein complexes from DIP PPI network, available as edge lists. Format: gene_ID1 gene_ID2 edge-weight. A blank line (i.e. two newline characters) separates the edges of one complex from another complex's edges.
Experiment 2: Training on MIPS and Testing on TAP-MS:
Experiment 2 Intermediate output results:
Training data, i.e. feature matrix of MIPS complexes (with edge weights from hu.MAP), available as a spreadsheet of rows of feature vectors, each corresponding to a positive or negative protein complex. Format: density, number of nodes, degree statistics- maximum, mean, median and variance, clustering coefficient (CC) statistics- maximum, mean and variance, edge weight statistics- mean, maximum and variance, degree correlation (DC) statistics - mean, variance and maximum, 3 Singular
From https://openpsychometrics.org/tests/FSIQ/
This data is the results collected from a test that measures IQ with an overall score as well as scores for component abilities.
The Intelligence Quotient (IQ) is the measure of human cogntive ability. Scores are set so that the average is 100. There is controversey about how IQ scores should be broken down, this test uses the 3 domains from Hampshire, Highfield, Parkin and Owen (2012) of (1) Short-Term Memory, (2) Reasoning, and (3) Verbal. This model seems to fit best on internet populations.
While a lot of working has been put into making sure this has good measurement properties, it is not a replacement for a real IQ test. Not on-line test is. A main reason is that no one in the on-line context has the attention span for a reliable assessment. The Wechsler Adult Intelligence Scale (a reasonable candidate for the title 'gold standard IQ test') takes more than an hour [1]. This test was designed with a cap of taking 20 minutes, because we like engagement and engagement drops off fast after that. This should be understood as demonstration of how an IQ test can work, rather than a score you should make decisions based off of. Procedure
This test has six sections. Each section has its own instructions. This test is meant to be taken solo, without references or materials. It will only be valid the first time the taker take it, so if the taker want an accurate result they must not start the test until they are ready. The average person takes between 10 and 15 minutes to finish the test, and this is important.
This experimental IQ test has 25 items, each an matrix with one tile missing and eight possibilities for that tile.
The items are /questions/ folder, in subfolders **1...25 which correspond to Q1....Q25** in the data file. In the folder for each question you will find the files:
q.png - the incomplete matrix. 1.png.....7.png - wrong answers a.png - the right answer.
In the data file, if Q1 is 4, that means they chose 4.png as their answer. If it is 10, that means they chose a.png as their answer.
At the end they were also directed to indicate:
**gender **- chosen from a drop down list where 1=male, 2=female, 3=other. age - entered as free response (ages < 18 removed)
NOTES: 1. The possible answers were presented in two rows of four with a random order for each participant. 2. The collection of this data was of mediocre quality.
So, can we train a model to defeat this test like a human?
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset can be used train and test the BRITTANY tool. Information contained in the dataset is especially suitable to be used as train and test data for neural network-based classifiers.
This dataset contains 198 Rosbag files, of 5 seconds duration, recorded in different locations (kitchen, livingroom-window and livingroom-door) with Orbi-One robot stood still. Two sorts of Rosbag files have been recorded. In 90 Rosbag files (train*.bag), data recorded correspond to a person walking in a straight line in front of the robot. Data from five different people have been recorded. For each location and person, six Rosbag files have been recorded.
In 108 Rosbag files (test*.bag), data recorded correspond to a person walking in a straight line in front of the robot. Data from six different people have been recorded. Five of those six people are the same as in the other rosbags and the other one is not registered in the system to evaluate the false-positive cases in the system.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Model performance results based on random forest, gradient boosting, penalized logistic regression, XGBoost, SVM, neural network, and stacking for EMBARC data as training set and APAT data as testing set after multiple imputation for 10 times.
http://rightsstatements.org/vocab/InC/1.0/http://rightsstatements.org/vocab/InC/1.0/
This dataset comprises a set of information cascades generated by Singapore Twitter users. Here a cascade is defined as a set of tweets about the same topic. This dataset was collected via the Twitter REST and streaming APIs in the following way. Starting from popular seed users (i.e., users having many followers), we crawled their follow, retweet, and user mention links. We then added those followers/followees, retweet sources, and mentioned users who state Singapore in their profile location. With this, we have a total of 184,794 Twitter user accounts. Then tweets are crawled from these users from 1 April to 31 August 2012. In all, we got 32,479,134 tweets. To identify cascades, we extracted all the URL links and hashtags from the above tweets. And these URL links and hashtags are considered as the identities of cascades. In other words, all the tweets which contain the same URL link (or the same hashtag) represent a cascade. Mathematically, a cascade is represented as a set of user-timestamp pairs. Figure 1 provides an example, i.e. cascade C = {< u1, t1 >, < u2, t2 >, < u1, t3 >, < u3, t4 >, < u4, t5 >}. For evaluation, the dataset was split into two parts: four months data for training and the last one month data for testing. Table 1summarizes the basic (count) statistics of the dataset. Each line in each file represents a cascade. The first term in each line is a hashtag or URL, the second term is a list of user-timestamp pairs. Due to privacy concerns, all user identities are anonymized.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Motivation. Existing large language models designed to predict antibody structure and function have been trained exclusively with unpaired antibody sequences. This is a substantial drawback, as each antibody represents a unique pairing of heavy and light chains that both contribute to antigen recognition. The cost of generating large datasets of natively paired antibody sequences is orders of magnitude higher than the cost of unpaired sequences, and the paucity of available paired antibody sequence datasets precludes training a state-of-the-art language model using only paired training data. Here, we sought to determine whether and to what extent natively paired training data improves model performance.
Results. Using a unique and recently reported dataset of approximately 1.6 x 106 natively paired human antibody sequences, we trained two baseline antibody language model (BALM) variants: BALM-paired and BALM-unpaired. We quantify the superiority of BALM-paired over BALM-unpaired, and we show that BALM-paired's improved performance can be attributed at least in part to its ability to learn cross-chain features that span natively paired heavy and light chains. Additionally, we fine-tuned the general protein language model ESM-2 using these paired antibody sequences and report that the fine-tuned model, but not base ESM-2, demonstrates a similar understanding of cross-chain features.
Files. The following files are included in this repository:
Code: All code used for model training, testing, and figure generation is available under the MIT license on GitHub.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Current theories of numerical cognition posit that uniquely human symbolic number abilities connect to an early developing cognitive system for representing approximate numerical magnitudes, the approximate number system (ANS). In support of this proposal, recent laboratory-based training experiments with U.S. children show enhanced performance on symbolic addition after brief practice comparing or adding arrays of dots without counting: tasks that engage the ANS. Here we explore the nature and generality of this effect through two brief training experiments. In Experiment 1, elementary school children in Pakistan practiced either a non-symbolic numerical addition task or a line-length addition task with no numerical content, and then were tested on symbolic addition. After training, children in the numerical training group completed the symbolic addition test faster than children in the line length training group, suggesting a causal role of brief, non-symbolic numerical training on exact, symbolic addition. These findings replicate and extend the core findings of a recent U.S. laboratory-based study to non-Western children tested in a school setting, attesting to the robustness and generalizability of the observed training effects. Experiment 2 tested whether ANS training would also enhance the consistency of performance on a symbolic number line task. Over several analyses of the data there was some evidence that approximate number training enhanced symbolic number line placements relative to control conditions. Together, the findings suggest that engagement of the ANS through brief training procedures enhances children's immediate attention to number and engagement with symbolic number tasks.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The Comedy Scraps Corpus is a comprehensive collection of stand-up comedy scripts meticulously compiled from performances held at "The Loft" comedy club from 2020 to 2022. This corpus serves as a treasure trove for natural language processing (NLP) enthusiasts, researchers, and comedy fans alike, offering a unique lens into the evolution of comedic content over time.
Spanning two years, the Comedy Scraps Corpus encapsulates the diverse comedic styles, themes, and linguistic nuances employed by many talented comedians who graced the stage at The Loft. From witty one-liners to elaborate storytelling, this corpus captures the essence of comedic expression in its rawest form.
One of the distinctive features of the Comedy Scraps Corpus is its chronological arrangement, allowing researchers and analysts to trace the evolution of comedic trends, topics, and language usage over the designated period. By examining scripts from different years, users can discern the subtle shifts in comedic sensibilities, audience preferences, and cultural influences that have shaped the comedy landscape during this time frame.
The corpus comprises various comedic material, ranging from observational humor to political satire, from surrealistic narratives to cultural commentary. Each script is meticulously annotated, providing valuable metadata such as performer name, performance date, and audience response, facilitating in-depth analyses and comparative studies.
Moreover, the Comedy Scraps Corpus is a valuable resource for training and testing NLP models, enabling researchers to develop sophisticated algorithms for tasks such as joke generation, sentiment analysis, and humor recognition. By leveraging the rich and varied content within the corpus, developers can explore innovative approaches to computational humor and language understanding.
In summary, the Comedy Scraps Corpus is a testament to stand-up comedy's vibrancy and diversity, offering a comprehensive glimpse into the evolution of comedic discourse from 2020 to 2022. Whether for academic research, algorithmic development, or simply for the joy of comedic exploration, this corpus provides an invaluable resource for anyone interested in the intersection of language, humor, and culture.
polyOne Data Set
The data set contains 100 million hypothetical polymers each with 29 predicted properties using machine learning models. We use PSMILES strings to represent polymer structures, see here and here. The polymers are generated by decomposing previously synthesized polymers into unique chemical fragments. Random and enumerative compositions of these fragments yield 100 million hypothetical PSMILES strings. All PSMILES strings are chemically valid polymers but, mostly, have never been synthesized before. More information can be found in the paper. Please note the license agreement in the LICENSE file.
Full data set including the properties
The data files are in Apache Parquet format. The files start with `polyOne_*.parquet`.
I recommend using dask (`pip install dask`) to load and process the data set. Pandas also works but is slower.
Load sharded data set with dask
```python
import dask.dataframe as dd
ddf = dd.read_parquet("*.parquet", engine="pyarrow")
```
For example, compute the description of data set
```python
df_describe = ddf.describe().compute()
df_describe
```
PSMILES strings only
In this dataset, the initial schedules used to perform the training and testing processes for the reference paper can be found. In addition, the results summarization archives are available.
The dataset comprises of the following columns:
"code" "clientType" "registrationMode" "planName" "accident" "duration" "country" "netSales" "netProfit" "gender" "age"
Accident is the column to be predicted.
The test set is available in the file test_data.csv.
Constraints
Output should match the number of test cases available in test_data.csv
Output Format
Output prediction should be a probability value of class "1" , a value between 0 and 1, new line delimited for each row in the test data set.
No column names need to be entered.
We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The file Corpora.txt keeps the corpus used to train the model and the different instances of the classifier. It is basically a text file with one sentence per line from the original corpus called test.tsv available at https://github.com/google-research-datasets/wiki-split.git. We eliminated punctuation marks and special characters from the original file putting each sentence per line.
Enju_Output.txt holds the outputs generated by Enju in -so mode (Output in stand-off format) using Corpora.txt as input. This file has basically a natural language English per-sentence parse with a wide-coverage probabilistic for HPSG grammar.
The file Supervision.txt keeps the grammatical tags of the corpus. This file holds a tag per word and each tag is situated in a single line. Sentences are separated by one empty line while tags from words in the same sentence are located in adjacent lines.
The file Word_Category.txt carries the coarse-grained word category information needed by the model and introduced in it by apical dendrites. Each word in the corpus has a word-category tag which provides additional constraints to those provided by lateral dendrites. This file contains a tag per word and each tag is situated in a single line. Sentences are separated by one empty line while tags from words in the same sentence are located in adjacent lines.
The file SynSemTests.xlsx keeps all the grammar classification results as well as the statistical analysis in the classification tests.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
WiFi CSI-Based Long-Range Through-Wall Human Activity Recognition with the ESP32 This repository contains the WiFi CSI human presence detection and activity recognition datasets proposed in [1]. Datasets DP_LOS - Line-of-sight (LOS) presence detection dataset, comprised of 392 CSI amplitude spectrograms. DP_NLOS - Non-line-of-sight (NLOS) presence detection dataset, comprised of 384 CSI amplitude spectrograms. DA_LOS - LOS activity recognition dataset, comprised of 392 CSI amplitude spectrograms. DA_NLOS - NLOS activity recognition dataset, comprised of 384 CSI amplitude spectrograms. Table 1: Characteristics of presence detection and activity recognition datasets. Dataset Scenario #Rooms #Persons #Classes Packet Sending Rate Interval #Spectrograms DP_LOS LOS 1 1 6 100Hz 4s (400 packets) 392 DP_NLOS NLOS 5 1 6 100Hz 4s (400 packets) 384 DA_LOS LOS 1 1 3 100Hz 4s (400 packets) 392 DA_NLOS NLOS 5 1 3 100Hz 4s (400 packets) 384 Data Format Each dataset employs an 8:1:1 training-validation-test split, defined in the provided label files trainLabels.csv, validationLabels.csv, and testLabels.csv. Label files use the sample format [i c], with i corresponding to the spectrogram index (i.png) and c corresponding to the class. For presence detection datasets (DP_LOS , DP_NLOS), c in {0 = "no presence", 1 = "presence in room 1", ..., 5 = "presence in room 5"}. For activity recognition datasets (DA_LOS , DA_NLOS), c in {0="no activity", 1="walking", and 2="walking + arm-waving"}. Furthermore, the mean and standard deviation of a given dataset are provided in meanStd.csv. Download and UseThis data may be used for non-commercial research purposes only. If you publish material based on this data, we request that you include a reference to our paper [1]. [1] Strohmayer, Julian, and Martin Kampel. "WiFi CSI-Based Long-Range Through-Wall Human Activity Recognition with the ESP32" International Conference on Computer Vision Systems. Cham: Springer Nature Switzerland, 2023. BibTeX citation: @inproceedings{strohmayer2023wifi, title={WiFi CSI-Based Long-Range Through-Wall Human Activity Recognition with the ESP32}, author={Strohmayer, Julian and Kampel, Martin}, booktitle={International Conference on Computer Vision Systems}, pages={41--50}, year={2023}, organization={Springer} }
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the dataset for the competition ABC2018. The competition website is:https://competitions.codalab.org/competitions/16283FileA zip file contains all the training and test trajectories, and a ground truth label file for the training set.When you unzip the dataset, you find./test/***.csv./train/***.csv./train_labels.csvwhere *** is the trajectory number (000, 001, ..., 630 for train, 000, 001, ..., 274 for test).TaskClassifying GPS trajectories of birds into male or femaleTrajectory file formatA single CSV file (000.csv, 001.csv, ...) contains a trajectory of a trip, and each line represents the information of a GPS location of a shearwater. In addition to longitude and latitude, some other information is provided; elapsed time and local clock time, solar azimuth and elevation angles.- float: longitude- float: latitude- float: sun azimuth [degree] clockwise from the North- float: sun elevation [degree] upward from the horizon- int: (1) daytime (between sunrise and sunset), or (0) nighttime- int: elapsed time [second] after starting the trip- clock: local time (hh:mm:ss)- int: days (starts from 0, and increments by 1 when the local time passes 23:59:59)Float values are of the format %.5f, and fields are separated by a single comma.Here is an example:=================139.29220,38.56632,76.42170,-4.45122,0,0,04:54:03,0139.29300,38.56763,76.58196,-4.25726,0,60,04:55:03,0139.29400,38.57053,76.73674,-4.06880,0,118,04:56:01,0139.29620,38.57563,76.89729,-3.87201,0,178,04:57:01,0...=================- Different trajectories have different number of GPS locations.- The time interval between successive two GPS locations is approximately one minute (60 seconds) when GPS works well, otherwise interval may vary from one to several minutes, even hours and days.- Trajectories in the training and test sets are in the same format.- Ground truth labels for the training set are given in a separate file. Labels: gender, or male/femaleA single txt file of ground truth labels of the training set is provided. Each line has the label of the corresponding training trajectory; that is, line 0 is the label of the training trajectory file 000.csv.Label is binary (character):- male: 0- female: 1Here is an example:=================1110100...================= Stats: Numbers of the datasetTraining set- 326 male trajectories- 305 female trajectories- 631 in totalTest set- 275 trajectoriesDisclaimerThe procedures used in the field study for collecting the data were approved by the Animal Experimental Committee of Nagoya University.License of the datasetThe dataset was collected by scientific teams for scientific purpose. If you use the dataset for any scientific purposes except this competition, please refer the following paper:Sakiko MATSUMOTO, Takashi YAMAMOTO, Maki YAMAMOTO, Carlos B ZAVALAGA and Ken YODA (2017) Sex-related differences in the foraging movement of Streaked Shearwaters Calonectris leucomelas breeding on Awashima Island in the Sea of Japan. Ornithological Science 16(1):23-32. 2017 doi: http://dx.doi.org/10.2326/osj.16.23Please contact the corresponding researcher, Ken Yoda (http://yoda-ken.sakura.ne.jp/yoda_lab/English.html), if you would like to use the dataset for any other purposes, or access un-preprocessed original raw data.
The grounding lines for the entire Antarctic coastline for available Sentinel1-a/b tracks in 2018 are provided as Shapefiles for the 6-day and 12-day tracks separately, as "AllTracks_6d_GL.shp" and "AllTracks_12d_GL.shp" respectively. The corresponding uncertainty estimates are also provided, as described in the manuscript, which are labelled as "AllTracks_6d_uncertainty.shp" and "AllTracks_12d_uncertainty.shp".
Each grounding line in the Shapefile contains 6 attribudes:
ID: grounding line ID for each DInSAR scene
Type: whether the line was used as training or testing data.
Class: whether each identifined line is a grounding line or a pinning point
Length: length of the enclosing polygon determining the uncertainty
Width: width of the enclosing polygon determining the uncertainty
FILENAME: name of the original shapefile for the grounding line (before all files were combined into one), which gives all relevant information of the DInSAR data, in the fo...
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
BASIC INFORMATION
--------------------
Czech Text Document Corpus v 2.0 is a collection of text documents for automatic document classification in Czech language. It is composed of the text documents provided by the Czech News Agency and is freely available for research purposes. This corpus was created in order to facilitate a straightforward comparison of the document classification approaches on Czech data. It is particularly dedicated to evaluation of multi-label document classification approaches, because one document is usually labelled with more than one label. Besides the information about the document classes, the corpus is also annotated at the morphological layer.
The main part (for training and testing) is composed of 11,955 real newspaper articles. We provide also a development set which is intended to be used for tuning of the hyper-parameters of the created models. This set contains 2735 additional articles.
The total category number is 60 out of which 37 most frequent ones are used for classification. The reason of this reduction is to keep only the classes with the sufficient number of occurrences to train the models.
Technical Details
------------------------
Text documents are stored in the individual text files using UTF-8 encoding. Each filename is composed of the serial number and the list of the categories abbreviations separated by the underscore symbol and the .txt suffix. Serial numbers are composed of five digits and the numerical series starts from the value one.
For instance the file 00046_kul_nab_mag.txt represents the document file number 46 annotated by the categories kul (culture), nab (religion) and mag (magazine selection). The content of the document, i.e. the word tokens, is stored in one line. The tokens are separated by the space symbols.
Every text document was further automatically mophologically analyzed. This analysis includes lemmatization, POS tagging and syntactic parsing. The fully annotated files are stored in .conll files. We also provide the lemmatized form, file with suffix .lemma, and appropriate POS-tags, see .pos files. The tokenized version of the documents is also available in .tok files.
This corpus is available only for research purposes for free. Commercial use in any form is strictly excluded.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the Post-Evaluation data for SemEval-2021 Task 12: Learning with Disagreement, a shared task on learning to classify with datasets containing disagreements.
The aim of this shared task is to provide a unified testing framework for learning from disagreements using the best-known datasets containing information about disagreements for interpreting language and classifying images:
1. LabelMe-IC: Image Classification using a subset of LabelMe images (Russell et al., 2008), is a widely used, community-created image classification dataset where images are assigned to one of 8 categories: highway, inside city, tall building, street, forest, coast, mountain, open country. Rodrigues and Pereira (2017) collected crowd labels for these images using Amazon Mechanical Turk (AMT).
2. CIFAR10-IC: Image Classification using a subset of CIFAR-10 dataset, https://www.cs.toronto.edu/~kriz/cifar.html. The entire dataset consists of colour images in 10 categories (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck). Crowdsourced labels for this dataset were collected by Peterson et al (2019).
3. PDIS: Information Status Classification using Phrase Detectives Information. Information Status Classification (IS) in Phrase Detectives (Poesio et al., 2019) dataset involves identifying the information status of a noun phrase: whether that noun phrase refers to new information or to old information.
4. Gimpel-POS: Part-of-Speech tagging using the Gimpel dataset (Gimpel et al., 2011) for Twitter posts. Plank et al.(2014b) mapped the Gimpel tags to the universal tag set (Petrov et al., 2011), using these tags as gold, and collected crowdsourced labels.
5. Humour: ranking one-line texts using pairwise funniness judgements (Simpson et al., 2019). Crowdworkers have annotated pairs of puns to indicate which is funniest. A gold standard ranking was produced using a large number of redundant annotations. The goal is to infer the gold standard ranking from a reduced number of crowdsourced judgements.
The files contained in this data collection are as follows: starting_kit.zip - Base models used provided for the shared task. practice_phase_data.zip - The training and development data used during the Practice Phase of the competition. test_phase_data.zip - The test data, used during the Evaluation Phase of the competition
Details of format of each dataset for each task can be found on Codalab.
Attribution 2.0 (CC BY 2.0)https://creativecommons.org/licenses/by/2.0/
License information was derived automatically
This repository contains the data presented in the paper "On-line Active Reward Learning for Policy Optimisation in Spoken Dialogue Systems" in ACL 2016. Two separate datasets as described in section 4 of the paper are presented: 1. DialogueEmbedding/ It contains the [train|valid|test] data for the unsupervised dialogue embedding creation, each with *.[feature|reward|turn|subjsuc]. Note that *.turn includes the lines to be read for each dialogue in *.[feature|reward|subjsuc], and *.subjsuc is the user's subjective rating. The feature size is 74. 2. DialoguePolicy/ It includes four contrasting systems with different reward models: [GP|RNN|ObjSubj|Subj]. Inside each system directory is the data obtained in interaction with Amazon Mechanical Turk users while training three policies with same config: policy_[1|2|3]. and a .csv for the evaluation result along with the trainig process. In each policy_[1|2|3]/ there is a list of calls with a time stamp in the name which contains session.xml file for dialogue log and feedback.xml file for user feedback
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset and pre-trained models are released as a companion to our OOPSLA '20 publication: "Neural Reverse Engineering of Stripped Binaries using Augmented Control Flow Graphs":
For the code of the "Nero" prototype, and more information about the above artifacts see our Github repo
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Details of experiments are given in the paper, titled 'Super.Complex: A supervised machine learning pipeline for molecular complex detection in protein-interaction networks.': https://doi.org/10.1371/journal.pone.0262056
For additional details, please see https://sites.google.com/view/supercomplex/super-complex-v3-0
Supporting code is available on github at: https://github.com/marcottelab/super.complex
Details of files provided for each experiment are given below:
Toy network experiment
Input data:
Toy network, available as a weighted edge list. Format: node1 node2 edge-weight
All raw toy communities, available as node lists. Format: node1 node2 node3 .. Each line represents a community.
Intermediate output results:
Training toy communities, available as node lists. Format: node1 node2 node3 .. Each line represents a community.
Testing toy communities, available as node lists. Format: node1 node2 node3 .. Each line represents a community.
Training toy communities feature matrix, available as a spreadsheet of rows of feature vectors, each corresponding to a positive or negative community. Format: density, number of nodes, degree statistics- maximum, mean, median and variance, clustering coefficient (CC) statistics- maximum, mean and variance, edge weight statistics- mean, maximum and variance, degree correlation (DC) statistics - mean, variance and maximum, 3 Singular values of the subgraph's adjacency matrix, Label (positive or negative community, indicated by 1 or 0 respectively)
Testing toy communities feature matrix, available as a spreadsheet of rows of feature vectors, each corresponding to a positive or negative community. Format: density, number of nodes, degree statistics- maximum, mean, median and variance, clustering coefficient (CC) statistics- maximum, mean and variance, edge weight statistics- mean, maximum and variance, degree correlation (DC) statistics - mean, variance and maximum, 3 Singular values of the subgraph's adjacency matrix, Label (positive or negative community, indicated by 1 or 0 respectively)
Output results:
Trained toy community fitness function, available as pickled files of a data pre-processor and a machine learning model from sklearn, that can be imported in python using the pickle module, for example using the commands, import pickle; pickle.load(filename)
Learned toy communities, available as node lists. Format: node1 node2 node3 .. nodeN Score. Each line represents a community. The score is the community fitness function of the community.
Learned toy communities, available as edge lists. Format: node1 node2 edge-weight. A blank line (i.e. two newline characters) separates the edges of one community from another community's edges.
hu.MAP experiment:
Input data:
hu.MAP PPI (protein-protein interaction) network, available as a weighted edge list. Format: gene_ID1 gene_ID2 edge-weight
All raw human protein complexes from CORUM, available as node lists. Format: gene_ID1 gene_ID2 gene_ID3 .. Each line represents a protein complex.
Intermediate output results:
Training complexes from CORUM, available as node lists. Format: gene_ID1 gene_ID2 gene_ID3 .. Each line represents a protein complex.
Testing complexes from CORUM, available as node lists. Format: gene_ID1 gene_ID2 gene_ID3 .. Each line represents a protein complex.
Training data, i.e. feature matrix of CORUM complexes (with edge weights from hu.MAP PPI network), available as a spreadsheet of rows of feature vectors, each corresponding to a positive or negative protein complex. Format: density, number of nodes, degree statistics- maximum, mean, median and variance, clustering coefficient (CC) statistics- maximum, mean and variance, edge weight statistics- mean, maximum and variance, degree correlation (DC) statistics - mean, variance and maximum, 3 Singular values of the subgraph's adjacency matrix, Label (positive or negative protein complex, indicated by 1 or 0 respectively)
Testing data, i.e. feature matrix of CORUM complexes (with edge weights from hu.MAP PPI network), available as a spreadsheet of rows of feature vectors, each corresponding to a positive or negative protein complex. Format: density, number of nodes, degree statistics- maximum, mean, median and variance, clustering coefficient (CC) statistics- maximum, mean and variance, edge weight statistics- mean, maximum and variance, degree correlation (DC) statistics - mean, variance and maximum, 3 Singular values of the subgraph's adjacency matrix, Label (positive or negative protein complex, indicated by 1 or 0 respectively)
Output results:
Trained community fitness function of CORUM complexes (with edge weights from hu.MAP), available as pickled files of a data pre-processor and a machine learning model from sklearn, that can be imported in python using the pickle module, for example using the commands, import pickle; pickle.load(filename)
Learned protein complexes from hu.MAP PPI network, available as node lists. Format: Excel file, where the columns are - Learned complex name (Named as the most similar CORUM complex, prepended by the Jaccard coefficient similarity) , Proteins in learned complex (gene names, i.e gene_name1 gene_name2 gene_name3 .. gene_nameN ), Proteins in learned complex (gene IDs, i.e gene_ID1 gene_ID2 gene_ID3 .. gene_IDN ) and Score (Community fitness function of the learned protein complex)
Learned protein complexes from hu.MAP PPI network, available as gene ID edge lists. Format: gene_ID1 gene_ID2 edge-weight. A blank line (i.e. two newline characters) separates the edges of one complex from another complex's edges.
Learned protein complexes from hu.MAP PPI network, available as gene name edge lists. Format: gene_name1 gene_name2 edge-weight. A blank line (i.e. two newline characters) separates the edges of one complex from another complex's edges.
Yeast experiments:
Input data:
DIP yeast PPI network, available as a weighted edge list. Format: gene_ID1* gene_ID2 edge-weight
Yeast protein complexes from MIPS, available as node lists. Format: gene_ID1 gene_ID2 gene_ID3 .. Each line represents a protein complex.
Yeast protein complexes from TAP-MS, available as node lists. Format: gene_ID1 gene_ID2 gene_ID3 .. Each line represents a protein complex.
Experiment 1: Training on TAP-MS and Testing on MIPS:
Experiment 1 Intermediate output results:
Training data, i.e. feature matrix of TAP-MS complexes (with edge weights from hu.MAP), available as a spreadsheet of rows of feature vectors, each corresponding to a positive or negative protein complex. Format: density, number of nodes, degree statistics- maximum, mean, median and variance, clustering coefficient (CC) statistics- maximum, mean and variance, edge weight statistics- mean, maximum and variance, degree correlation (DC) statistics - mean, variance and maximum, 3 Singular values of the subgraph's adjacency matrix, Label (positive or negative protein complex, indicated by 1 or 0 respectively)
Testing data, i.e. feature matrix of MIPS complexes (with edge weights from hu.MAP), available as a spreadsheet of rows of feature vectors, each corresponding to a positive or negative protein complex. Format: density, number of nodes, degree statistics- maximum, mean, median and variance, clustering coefficient (CC) statistics- maximum, mean and variance, edge weight statistics- mean, maximum and variance, degree correlation (DC) statistics - mean, variance and maximum, 3 Singular values of the subgraph's adjacency matrix, Label (positive or negative protein complex, indicated by 1 or 0 respectively)
Experiment 1 output results:
Trained community fitness function of TAP-MS complexes (with edge weights from DIP PPI network), available as pickled files of a data pre-processor and a machine learning model from sklearn, that can be imported in python using the pickle module, for example using the commands, import pickle; pickle.load(filename)
Learned protein complexes from DIP PPI network, available as node lists. Format: gene_ID1 gene_ID2 gene_ID3 .. gene_IDN Score. Each line represents a protein complex. The score is the community fitness function of the protein complex.
Learned protein complexes from DIP PPI network, available as edge lists. Format: gene_ID1 gene_ID2 edge-weight. A blank line (i.e. two newline characters) separates the edges of one complex from another complex's edges.
Experiment 2: Training on MIPS and Testing on TAP-MS:
Experiment 2 Intermediate output results:
Training data, i.e. feature matrix of MIPS complexes (with edge weights from hu.MAP), available as a spreadsheet of rows of feature vectors, each corresponding to a positive or negative protein complex. Format: density, number of nodes, degree statistics- maximum, mean, median and variance, clustering coefficient (CC) statistics- maximum, mean and variance, edge weight statistics- mean, maximum and variance, degree correlation (DC) statistics - mean, variance and maximum, 3 Singular