Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains >800K CSV files behind the GitTables 1M corpus.
For more information about the GitTables corpus, visit:
- our website for GitTables, or
Facebook
Twitterhttps://research.csiro.au/dap/licences/csiro-data-licence/https://research.csiro.au/dap/licences/csiro-data-licence/
A csv file containing the tidal frequencies used for statistical analyses in the paper "Estimating Freshwater Flows From Tidally-Affected Hydrographic Data" by Dan Pagendam and Don Percival.
Facebook
Twitter[doc] formats - csv - 1
This dataset contains one csv file at the root:
data.csv
kind,sound dog,woof cat,meow pokemon,pika human,hello
size_categories:
Facebook
Twitter[doc] formats - csv - 2
This dataset contains one csv file at the root:
data.csv
kind,sound dog,woof cat,meow pokemon,pika human,hello
configs: - config_name: default data_files: "*.csv" sep: "," size_categories:
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
A sample rainfall dataset containing 100 records (in CSV file format), which includes some missing values, has been created to practice fundamental data cleaning operations and to extract basic statistical information from the provided CSV file.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Residential School Locations Dataset [IRS_Locations.csv] contains the locations (latitude and longitude) of Residential Schools and student hostels operated by the federal government in Canada. All the residential schools and hostels that are listed in the Indian Residential School Settlement Agreement are included in this dataset, as well as several Industrial schools and residential schools that were not part of the IRRSA. This version of the dataset doesnât include the five schools under the Newfoundland and Labrador Residential Schools Settlement Agreement. The original school location data was created by the Truth and Reconciliation Commission, and was provided to the researcher (Rosa Orlandini) by the National Centre for Truth and Reconciliation in April 2017. The dataset was created by Rosa Orlandini, and builds upon and enhances the previous work of the Truth and Reconcilation Commission, Morgan Hite (creator of the Atlas of Indian Residential Schools in Canada that was produced for the Tk'emlups First Nation and Justice for Day Scholar's Initiative, and Stephanie Pyne (project lead for the Residential Schools Interactive Map). Each individual school location in this dataset is attributed either to RSIM, Morgan Hite, NCTR or Rosa Orlandini. Many schools/hostels had several locations throughout the history of the institution. If the school/hostel moved from itsâ original location to another property, then the school is considered to have two unique locations in this dataset,the original location and the new location. For example, Lejac Indian Residential School had two locations while it was operating, Stuart Lake and Fraser Lake. If a new school building was constructed on the same property as the original school building, it isn't considered to be a new location, as is the case of Girouard Indian Residential School.When the precise location is known, the coordinates of the main building are provided, and when the precise location of the building isnât known, an approximate location is provided. For each residential school institution location, the following information is provided: official names, alternative name, dates of operation, religious affiliation, latitude and longitude coordinates, community location, Indigenous community name, contributor (of the location coordinates), school/institution photo (when available), location point precision, type of school (hostel or residential school) and list of references used to determine the location of the main buildings or sites.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We introduce a large-scale dataset of the complete texts of free/open source software (FOSS) license variants. To assemble it we have collected from the Software Heritage archiveâthe largest publicly available archive of FOSS source code with accompanying development historyâall versions of files whose names are commonly used to convey licensing terms to software users and developers. The dataset consists of 6.5 million unique license files that can be used to conduct empirical studies on open source licensing, training of automated license classifiers, natural language processing (NLP) analyses of legal texts, as well as historical and phylogenetic studies on FOSS licensing. Additional metadata about shipped license files are also provided, making the dataset ready to use in various contexts; they include: file length measures, detected MIME type, detected SPDX license (using ScanCode), example origin (e.g., GitHub repository), oldest public commit in which the license appeared. The dataset is released as open data as an archive file containing all deduplicated license blobs, plus several portable CSV files for metadata, referencing blobs via cryptographic checksums.
For more details see the included README file and companion paper:
Stefano Zacchiroli. A Large-scale Dataset of (Open Source) License Text Variants. In proceedings of the 2022 Mining Software Repositories Conference (MSR 2022). 23-24 May 2022 Pittsburgh, Pennsylvania, United States. ACM 2022.
If you use this dataset for research purposes, please acknowledge its use by citing the above paper.
Facebook
TwitterThis dataset was created by Naman Kumar
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by Hilman Jihadi
Released under MIT
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Here is a description, how the datasets for a training notebook used for Telegram ML Contest solution were prepared.
The first part of the code samples was taken from a private version of this notebook.
Here is the statistics about classes of programming languages from Github Code Snippets database
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F833757%2F2fdc091661198e80559f8cb1d1a306ff%2FScreenshot%202023-11-07%20at%2021.24.42.png?generation=1699390166413391&alt=media" alt="">
From this database, 2 csv files were created - with 50000 code samples for each of the 20 programming languages included, with equal by numbers and stratified sampling. The files related here are sample_equal_prop_50000.csv and sample_equal_prop_50000.csv and sample_stratified_50000.csv, respectively.
Second option for capturing out additional examples was to run this notebook with making up larger amount of queries, 10000.
The resulted file is dataset-10000.csv - included to the data card
The statistics for the code programming languages is as on the next chart - it has 32 labeled classes
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F833757%2F7c04342da8ec1df266cd90daf00204f9%2FScreenshot%202023-10-13%20at%2020.52.13.png?generation=1699392769199533&alt=media" alt="">
To get a model more robust, code samples of 20 additional languages were collected in amount from 10 till 15 samples on more-less popular use cases. Also, for the class "OTHER", like regular language examples, according to the task of the competition, the text examples from this dataset with promts on Huggingface were added to the file. The resulted file here is rare_languages.csv - also in data card
The statistics for rare languages code snippets is as follows:
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F833757%2F0b340781c774d2acb988ce1567f4afa3%2FScreenshot%202023-11-08%20at%2001.13.07.png?generation=1699402436798661&alt=media" alt="">
For this stage of dataset creation, the number of the columns in sample_equal_prop_50000.csv and sample_stratified_50000.csv was cut out just for 2 - "snippet", "language", the version of file with equal numbers is in the data card - sample_equal_prop_50000_clean.csv
To prepare Bigquery dataset file, the column with index was cut out, and the column "content" was renamed to "snippet". These changes were saved in dataset-10000-clean.csv
After that, the files sample_equal_prop_50000_clean.csv and dataset-10000-clean.csv were combined together and saved as github-combined-file.csv
The prepared files took too much RAM to be read by Pandas library, so that is why additional prepocessing has been made - the symbols like quatas, commas, ampersands, new lines and adding tabs characters were cleaned out. After clieaning, the flies were merged with rare_languages.csv file and saved as github-combined-file-no-symbols-rare-clean.csv and sample_equal_prop_50000_-no-symbols-rare-clean.csv, respectively.
The final distribution of classes turned out to be the next one
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F833757%2Ff43e0cea4c565c9f7c808527b0dfa2da%2FScreenshot%202023-11-09%20at%2020.26.30.png?generation=1699558064765454&alt=media" alt="">
To be suitable for TF-DF format, to each programming language a certain label was given as well. The final labels are in the data card.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Complete dataset of âFilm Circulation on the International Film Festival Network and the Impact on Global Film Cultureâ
A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org
Please cite this when using the dataset.
Detailed description of the dataset:
1 Film Dataset: Festival Programs
The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.
The codebook (csv file â1_codebook_film-dataset_festival-programâ) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.
The csv file â1_film-dataset_festival-program_longâ comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.
The csv file â1_film-dataset_festival-program_wideâ consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list âBerlinaleâ. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.
2 Survey Dataset
The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.
The codebook â2_codebook_survey-datasetâ includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.
The csv file â2_survey-dataset_long-festivals_shared-consentâ consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.
The csv file â2_survey-dataset_wide-no-festivals_shared-consentâ consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.
3 IMDb & Scripts
The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.
The codebook â3_codebook_imdb-datasetâ includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.
The csv file â3_imdb-dataset_aka-titles_longâ contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.
The csv file â3_imdb-dataset_awards_longâ contains film award data in a long format, i.e. each row corresponds to an award of a given film.
The csv file â3_imdb-dataset_companies_longâ contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.
The csv file â3_imdb-dataset_crew_longâ contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.
The csv file â3_imdb-dataset_festival-runs_longâ contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.
The csv file â3_imdb-dataset_general-info_wideâ contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.
The csv file â3_imdb-dataset_release-info_longâ contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.
The csv file â3_imdb-dataset_websites_longâ contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.
The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.
The R script âr_1_unite_dataâ demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.
The R script âr_2_scrape_matchesâ reads in the dataset with the film characteristics described in the âr_1_unite_dataâ and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: âcosineâ and âosa.â where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.
The script âr_3_matchingâ creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.
The script âr_4_scraping_functionsâ creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.
The script âr_5a_extracting_info_sampleâ uses the function defined in the âr_4_scraping_functionsâ, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.
The script âr_5b_extracting_info_allâ extracts the data for the entire dataset of the identified matches.
The script âr_5c_extracting_info_skippedâ checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.
The script âr_check_logsâ is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.
4 Festival Library Dataset
The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.
The codebook (csv file â4_codebook_festival-library_datasetâ) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories,
Facebook
Twitterhttp://researchdatafinder.qut.edu.au/display/n9373http://researchdatafinder.qut.edu.au/display/n9373
QUT Research Data Respository Dataset Resource available for download
Facebook
Twitterhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html
Replication pack, FSE2018 submission #164: ------------------------------------------
**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: A Case Study of the PyPI Ecosystem **Note:** link to data artifacts is already included in the paper. Link to the code will be included in the Camera Ready version as well. Content description =================== - **ghd-0.1.0.zip** - the code archive. This code produces the dataset files described below - **settings.py** - settings template for the code archive. - **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset. This dataset only includes stats aggregated by the ecosystem (PyPI) - **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages themselves, which take around 2TB. - **build_model.r, helpers.r** - R files to process the survival data (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, `common.cache/survival_data.pypi_2008_2017-12_6.csv` in **dataset_full_Jan_2018.tgz**) - **Interview protocol.pdf** - approximate protocol used for semistructured interviews. - LICENSE - text of GPL v3, under which this dataset is published - INSTALL.md - replication guide (~2 pages)
Replication guide ================= Step 0 - prerequisites ---------------------- - Unix-compatible OS (Linux or OS X) - Python interpreter (2.7 was used; Python 3 compatibility is highly likely) - R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible) Depending on detalization level (see Step 2 for more details): - up to 2Tb of disk space (see Step 2 detalization levels) - at least 16Gb of RAM (64 preferable) - few hours to few month of processing time Step 1 - software ---------------- - unpack **ghd-0.1.0.zip**, or clone from gitlab: git clone https://gitlab.com/user2589/ghd.git git checkout 0.1.0 `cd` into the extracted folder. All commands below assume it as a current directory. - copy `settings.py` into the extracted folder. Edit the file: * set `DATASET_PATH` to some newly created folder path * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` - install docker. For Ubuntu Linux, the command is `sudo apt-get install docker-compose` - install libarchive and headers: `sudo apt-get install libarchive-dev` - (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools` Without this dependency, you might get an error on the next step, but it's safe to ignore. - install Python libraries: `pip install --user -r requirements.txt` . - disable all APIs except GitHub (Bitbucket and Gitlab support were not yet implemented when this study was in progress): edit `scraper/init.py`, comment out everything except GitHub support in `PROVIDERS`. Step 2 - obtaining the dataset ----------------------------- The ultimate goal of this step is to get output of the Python function `common.utils.survival_data()` and save it into a CSV file: # copy and paste into a Python console from common import utils survival_data = utils.survival_data('pypi', '2008', smoothing=6) survival_data.to_csv('survival_data.csv') Since full replication will take several months, here are some ways to speedup the process: ####Option 2.a, difficulty level: easiest Just use the precomputed data. Step 1 is not necessary under this scenario. - extract **dataset_minimal_Jan_2018.zip** - get `survival_data.csv`, go to the next step ####Option 2.b, difficulty level: easy Use precomputed longitudinal feature values to build the final table. The whole process will take 15..30 minutes. - create a folder `
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.
With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.
We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.
Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.
Usage
You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.
Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.
Data Extraction: In your terminal, you can call either
make
(recommended), or
julia --project="." --eval "using Pkg; Pkg.instantiate()" julia --project="." extract-oq.jl
Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.
Further Reading
Implementation of our experiments: https://github.com/mirkobunse/regularized-oq
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Ransomware is considered as a significant threat for most enterprises since past few years. In scenarios wherein users can access all files on a shared server, one infected host is capable of locking the access to all shared files. In the article related to this repository, we detect ransomware infection based on file-sharing traffic analysis, even in the case of encrypted traffic. We compare three machine learning models and choose the best for validation. We train and test the detection model using more than 70 ransomware binaries from 26 different families and more than 2500 h of ânot infectedâ traffic from real users. The results reveal that the proposed tool can detect all ransomware binaries, including those not used in the training phase (zero-days). This paper provides a validation of the algorithm by studying the false positive rate and the amount of information from user files that the ransomware could encrypt before being detected.
This dataset directory contains the 'infected' and 'not infected' samples and the models used for each T configuration, each one in a separated folder.
The folders are named NxSy where x is the number of 1-second interval per sample and y the sliding step in seconds.
Each folder (for example N10S10/) contains: - tree.py -> Python script with the Tree model. - ensemble.json -> JSON file with the information about the Ensemble model. - NN_XhiddenLayer.json -> JSON file with the information about the NN model with X hidden layers (1, 2 or 3). - N10S10.csv -> All samples used for training each model in this folder. It is in csv format for using in bigML application. - zeroDays.csv -> All zero-day samples used for testing each model in this folder. It is in csv format for using in bigML application. - userSamples_test -> All samples used for validating each model in this folder. It is in csv format for using in bigML application. - userSamples_train -> User samples used for training the models. - ransomware_train -> Ransomware samples used for training the models - scaler.scaler -> Standard Scaler from python library used for scale the samples. - zeroDays_notFiltered -> Folder with the zeroDay samples.
In the case of N30S30 folder, there is an additional folder (SMBv2SMBv3NFS) with the samples extracted from the SMBv2, SMBv3 and NFS traffic traces. There are more binaries than the ones presented in the article, but it is because some of them are not "unseen" binaries (the families are present in the training set).
The files containing samples (NxSy.csv, zeroDays.csv and userSamples_test.csv) are structured as follows: - Each line is one sample. - Each sample has 3*T features and the label (1 if it is 'infected' sample and 0 if it is not). - The features are separated by ',' because it is a csv file. - The last column is the label of the sample.
Additionally we have placed two pcap files in root directory. There are the traces used for compare both versions of SMB.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset is a set of network traffic traces in pcap/csv format captured from a single user. The traffic is classified in 5 different activities (Video, Bulk, Idle, Web, and Interactive) and the label is shown in the filename. There is also a file (mapping.csv) with the mapping of the host's IP address, the csv/pcap filename and the activity label.
Activities:
Interactive: applications that perform real-time interactions in order to provide a suitable user experience, such as editing a file in google docs and remote CLI's sessions by SSH. Bulk data transfer: applications that perform a transfer of large data volume files over the network. Some examples are SCP/FTP applications and direct downloads of large files from web servers like Mediafire, Dropbox or the university repository among others. Web browsing: contains all the generated traffic while searching and consuming different web pages. Examples of those pages are several blogs and new sites and the moodle of the university. VĂdeo playback: contains traffic from applications that consume video in streaming or pseudo-streaming. The most known server used are Twitch and Youtube but the university online classroom has also been used. Idle behaviour: is composed by the background traffic generated by the user computer when the user is idle. This traffic has been captured with every application closed and with some opened pages like google docs, YouTube and several web pages, but always without user interaction.
The capture is performed in a network probe, attached to the router that forwards the user network traffic, using a SPAN port. The traffic is stored in pcap format with all the packet payload. In the csv file, every non TCP/UDP packet is filtered out, as well as every packet with no payload. The fields in the csv files are the following (one line per packet): Timestamp, protocol, payload size, IP address source and destination, UDP/TCP port source and destination. The fields are also included as a header in every csv file.
The amount of data is stated as follows:
Bulk : 19 traces, 3599 s of total duration, 8704 MBytes of pcap files Video : 23 traces, 4496 s, 1405 MBytes Web : 23 traces, 4203 s, 148 MBytes Interactive : 42 traces, 8934 s, 30.5 MBytes Idle : 52 traces, 6341 s, 0.69 MBytes
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A diverse selection of 1000 empirical time series, along with results of an hctsa feature extraction, using v1.06 of hctsa and Matlab 2019b, computed on a server at The University of Sydney.The results of the computation are in the hctsa file, HCTSA_Empirical1000.mat for use in Matlab using v1.06 of hctsa.The same data is also provided in .csv format for the hctsa_datamatrix.csv (results of feature computation), with information about rows (time series) in hctsa_timeseries-info.csv, information about columns (features) in hctsa_features.csv (and corresponding hctsa code used to compute each feature in hctsa_masterfeatures.csv), and the data of individual time series (each line a time series, for time series described in hctsa_timeseries-info.csv) is in hctsa_timeseries-data.csv. These .csv files were produced by running >>OutputToCSV(HCTSA_Empirical1000.mat,true,true); in hctsa.The input file, INP_Empirical1000.mat, is for use with hctsa, and contains the time-series data and metadata for the 1000 time series. For example, massive feature extraction from these data on the user's machine, using hctsa, can proceed as>> TS_Init('INP_Empirical1000.mat');Some visualizations of the dataset are in CarpetPlot.png (first 1000 samples of all time series as a carpet (color) plot) and 150TS-250samples.png (conventional time-series plots of the first 250 samples of a sample of 150 time series from the dataset). More visualizations can be performed by the user using TS_PlotTimeSeries from the hctsa package.See links in references for more comprehensive documentation for performing methodological comparison using this dataset, and on how to download and use v1.06 of hctsa.
Facebook
Twitter[doc] manual configuration 3
This dataset contains two csv files in the data/ directory and one csv file in the holdout/ directory, and a YAML field configs that specifies the data files and splits, using glob expressions.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Walmart data in CSV format extracted by crawl feeds team using in-house tools. Last extracted on 15 Aug 2022.
Product Lists
Walmart dataset,retail datasets,ecommerce datasets
10
Free
Facebook
TwitterThis is a sample of CSV files that can be downloaded from Microsoft Teams after meetings. As MS Teams was used lately in delivering classes in schools and universities, it was important to follow the attendance of the students, therefore, this dataset can be used in writing the code that analyses the attendance of the students.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains >800K CSV files behind the GitTables 1M corpus.
For more information about the GitTables corpus, visit:
- our website for GitTables, or