Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By SocialGrep [source]
A subreddit dataset is a collection of posts and comments made on Reddit's /r/datasets board. This dataset contains all the posts and comments made on the /r/datasets subreddit from its inception to March 1, 2022. The dataset was procured using SocialGrep. The data does not include usernames to preserve users' anonymity and to prevent targeted harassment
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
In order to use this dataset, you will need to have a text editor such as Microsoft Word or LibreOffice installed on your computer. You will also need a web browser such as Google Chrome or Mozilla Firefox.
Once you have the necessary software installed, open the The Reddit Dataset folder and double-click on the the-reddit-dataset-dataset-posts.csv file to open it in your preferred text editor.
In the document, you will see a list of posts with the following information for each one: title, sentiment, score, URL, created UTC, permalink, subreddit NSFW status, and subreddit name.
You can use this information to analyze trends in data sets posted on /r/datasets over time. For example, you could calculate the average score for all posts and compare it to the average score for posts in specific subReddits. Additionally, sentiment analysis could be performed on the titles of posts to see if there is a correlation between positive/negative sentiment and upvotes/downvotes
- Finding correlations between different types of datasets
- Determining which datasets are most popular on Reddit
- Analyzing the sentiments of post and comments on Reddit's /r/datasets board
If you use this dataset in your research, please credit the original authors.
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: the-reddit-dataset-dataset-comments.csv | Column name | Description | |:-------------------|:---------------------------------------------------| | type | The type of post. (String) | | subreddit.name | The name of the subreddit. (String) | | subreddit.nsfw | Whether or not the subreddit is NSFW. (Boolean) | | created_utc | The time the post was created, in UTC. (Timestamp) | | permalink | The permalink for the post. (String) | | body | The body of the post. (String) | | sentiment | The sentiment of the post. (String) | | score | The score of the post. (Integer) |
File: the-reddit-dataset-dataset-posts.csv | Column name | Description | |:-------------------|:---------------------------------------------------| | type | The type of post. (String) | | subreddit.name | The name of the subreddit. (String) | | subreddit.nsfw | Whether or not the subreddit is NSFW. (Boolean) | | created_utc | The time the post was created, in UTC. (Timestamp) | | permalink | The permalink for the post. (String) | | score | The score of the post. (Integer) | | domain | The domain of the post. (String) | | url | The URL of the post. (String) | | selftext | The self-text of the post. (String) | | title | The title of the post. (String) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit SocialGrep.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This book is written for statisticians, data analysts, programmers, researchers, teachers, students, professionals, and general consumers on how to perform different types of statistical data analysis for research purposes using the R programming language. R is an open-source software and object-oriented programming language with a development environment (IDE) called RStudio for computing statistics and graphical displays through data manipulation, modelling, and calculation. R packages and supported libraries provides a wide range of functions for programming and analyzing of data. Unlike many of the existing statistical softwares, R has the added benefit of allowing the users to write more efficient codes by using command-line scripting and vectors. It has several built-in functions and libraries that are extensible and allows the users to define their own (customized) functions on how they expect the program to behave while handling the data, which can also be stored in the simple object system.For all intents and purposes, this book serves as both textbook and manual for R statistics particularly in academic research, data analytics, and computer programming targeted to help inform and guide the work of the R users or statisticians. It provides information about different types of statistical data analysis and methods, and the best scenarios for use of each case in R. It gives a hands-on step-by-step practical guide on how to identify and conduct the different parametric and non-parametric procedures. This includes a description of the different conditions or assumptions that are necessary for performing the various statistical methods or tests, and how to understand the results of the methods. The book also covers the different data formats and sources, and how to test for reliability and validity of the available datasets. Different research experiments, case scenarios and examples are explained in this book. It is the first book to provide a comprehensive description and step-by-step practical hands-on guide to carrying out the different types of statistical analysis in R particularly for research purposes with examples. Ranging from how to import and store datasets in R as Objects, how to code and call the methods or functions for manipulating the datasets or objects, factorization, and vectorization, to better reasoning, interpretation, and storage of the results for future use, and graphical visualizations and representations. Thus, congruence of Statistics and Computer programming for Research.
Facebook
TwitterLoad and view a real-world dataset in RStudio
• Calculate “Measure of Frequency” metrics
• Calculate “Measure of Central Tendency” metrics
• Calculate “Measure of Dispersion” metrics
• Use R’s in-built functions for additional data quality metrics
• Create a custom R function to calculate descriptive statistics on any given dataset
Facebook
TwitterZooplankton biomass data collected from North Atlantic Ocean in 1958 - 1959 years received from NMFS.
Facebook
TwitterTemperature, salinity and other measurements found in datasets XBT and CTD taken from the MIRAI (R/V; call sign JNSR; built 1972 as Mutsu; renamed on 02/02/1996)in the North Pacific, Coastal N Pacific and other locations in 1999 (NODC Accession 0000857). Data were submitted by the Japan Agency for Marine-Earth Science and Technology (JAMSTEC).
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset Card for "ALMA-R-ko-en-Preference"
ref) https://huggingface.co/datasets/haoranxu/ALMA-R-Preference The triplet prference data, supporting 2 translation directions, is built upon the FLORES-200 development and test data. For each direction, we provide a source sentence along with three translations: one from GPT-4, another from EEVE-ALMA-LoRA, and a reference translation. For instance, in the English-German pair, our data structure is as follows:
Sentences:… See the full description on the dataset page: https://huggingface.co/datasets/qwopqwop/ALMA-R-ko-en.
Facebook
TwitterRelease of ships of opportunity XBT data compiled by the Maritime Environment Information Centre at the UKHO. The data represents 5 years (1995 - 2000) of archived SOOP XBT observations. The dataset contains 7214 observations from various vessels (see vessel summary), the data ranges from years 1984 to 2000. Previously to 1995 the data was released annually to the oceanographic community through ICES and US NODC, World Data Centre A.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Please note this dataset is the most recent version of the Administrative Boundaries (AB). For previous versions of the AB please go to this url: https://data.gov.au/data/dataset/previous-versions-of-the-geoscape-administrative-boundaries\r \r ----------------------------------\r \r Geoscape Administrative Boundaries is Australia’s most comprehensive national collection of boundaries, including government, statistical and electoral boundaries. It is built and maintained by Geoscape Australia using authoritative government data. Further information about contributors to Administrative Boundaries is available here.\r \r This dataset comprises seven Geoscape products:\r \r * Localities\r * Local Government Areas (LGAs)\r * Wards\r * Australian Bureau of Statistics (ABS) Boundaries\r * Electoral Boundaries\r * State Boundaries and\r * Town Points\r \r Updated versions of Administrative Boundaries are published on a quarterly basis.\r \r Users have the option to download datasets with feature coordinates referencing either GDA94 or GDA2020 datums.\r \r Notable changes in the November 2025 release\r \r * There have been spatial changes (area) greater than 1 km2 to the local government areas ‘Derwent Valley Council’ and ‘Brighton Council’ in Tasmania.\r \r * ‘Kairabah’ locality in ‘Logan City’ local government area has been retired in Queensland.\r \r * There have been spatial changes (area) greater than 1 km2 to the localities ‘Neilrex’, ‘Daruka’ and ‘Merrygoen’ in New South Wales, ‘Yarrabilba’, ‘Petrie’ and ‘Kallangur’ in Queensland, and ‘Flinders Ranges’ and ‘Angorigina’ in South Australia.\r \r \r Further information on Administrative Boundaries, including FAQs on the data, is available here or through Geoscape Australia’s network of partners. They provide a range of commercial products based on Administrative Boundaries, including software solutions, consultancy and support.\r \r \r \r Users must also note the following attribution requirements:\r \r Preferred attribution for the Licensed Material:\r \r
Administrative Boundaries © Geoscape Australia licensed by the Commonwealth of Australia under Creative Commons Attribution 4.0 International license (CC BY 4.0).\r \r Preferred attribution for Adapted Material:\r \r Incorporates or developed using Administrative Boundaries © Geoscape Australia licensed by the Commonwealth of Australia under Creative Commons Attribution 4.0 International licence (CC BY 4.0).\r \r
What to Expect When You Download Administrative Boundaries\r
\r Administrative Boundaries is large dataset (around 1.5GB unpacked), made up of seven themes each containing multiple layers.\r \r Users are advised to read the technical documentation including the product change notices and the individual product descriptions before downloading and using the product.\r \r Please note this dataset is the most recent version of the Administrative Boundaries (AB). For previous versions of the AB please go to this url: https://data.gov.au/dataset/ds-dga-b4ad5702-ea2b-4f04-833c-d0229bfd689e/details?q=previous\r \r
License Information\r
\r The Australian Government has negotiated the release of Administrative Boundaries to the whole economy under an open CCBY 4.0 license.\r \r Users must only use the data in ways that are consistent with the Australian Privacy Principles issued under the Privacy Act 1988 (Cth).\r \r
Facebook
TwitterThis dataset provides code, data, and instructions for replicating the analysis of Measuring Wikipedia Article Quality in One Dimension by Extending ORES with Ordinal Regression published in OpenSym 2021 (link to come). The paper introduces a method for transforming scores from the ORES quality models into a single dimensional measure of quality amenable for statistical analysis that is well-calibrated to a dataset. The purpose is to improve the validity of research into article quality through more precise measurement. The code and data for replicating the paper are found in this dataverse repository. If you wish to use method on a new dataset, you should obtain the actively maintaned version of the code from this git repository. If you attempt to replicate part of this repository please let me know via an email to nathante@uw.edu. Replicating the Analysis from the OpenSym Paper This project analyzes a sample of articles with quality labels from the English Wikipedia XML dumps from March 2020. Copies of the dumps are not provided in this dataset. They can be obtained via https://dumps.wikimedia.org/. Everything else you need to replicate the project (other than a sufficiently powerful computer) should be available here. The project is organized into stages. The prerequisite data files are provided at each stage so you do not need to rerun the entire pipeline from the beginning, which is not easily done without a high-performance computer. If you start replicating at an intermediate stage, this should overwrite the inputs to the downstream stages. This should make it easier to verify a partial replication. To help manage the size of the dataverse, all code files are included in code.tar.gz. Extracting this with tar xzvf code.tar.gz is the first step. Getting Set Up You need a version of R >= 4.0 and a version of Python >= 3.7.8. You also need a bash shell, tar, gzip, and make installed as they should be installed on any Unix system. To install brms you need a working C++ compiler. If you run into trouble see the instructions for installing Rstan. The datasets were built on CentOS 7, except for the ORES scoring which was done on Ubuntu 18.04.5 and building which was done on Debian 9. The RemembR and pyRembr projects provide simple tools for saving intermediate variables for building papers with LaTex. First, extract the articlequality.tar.gz, RemembR.tar.gz and pyRembr.tar.gz archives. Then, install the following: Python Packages Running the following steps in a new Python virtual environment is strongly recommended. Run pip3 install -r requirements.txt to install the Python dependencies. Then navigate into the pyRembr directory and run python3 setup.py install. R Packages Run Rscript install_requirements.R to install the necessary R libraries. If you run into trouble installing brms see the instructions on Drawing a Sample of Labeled Articles I provide steps and intermediate data files for replicating the sampling of labeled articles. The steps in this section are quite computationally intensive. Those only interested in replicating the models and analyses should skip this section. Extracting Metadata from Wikipedia Dumps Metadata from the Wikipedia dumps is required for calibrating models to the revision and article levels of analysis. You can use the wikiq Python script from the mediawiki dump tools git repository to extract metadata from the XML dumps as TSV files. The version of wikiq that was used is provided here. Running Wikiq on a full dump of English Wikipedia in a reasonable amount of requires considerable computing resources. For this project, Wikiq was run on Hyak a high performance computer at the University of Washington. The code for doing so is highly speicific to Hyak. For transparency and in case it helps others using similar academic computers this code is included in WikiqRunning.tar.gz. A copy of the wikiq output is included in this dataset in the multi-part archive enwiki202003-wikiq.tar.gz. To extract this archive, download all the parts and then run cat enwiki202003-wikiq.tar.gz* > enwiki202003-wikiq.tar.gz && tar xvzf enwiki202003-wikiq.tar.gz. Obtaining Quality Labels for Articles We obtain up-to-date labels for each article using the articlequality python package included in articlequality.tar.gz. The XML dumps are also the input to this step, and while it does not require a great deal of memory, a powerful computer (we used 28 cores) is helpful so that it completes in a reasonable amount of time. extract_quality_labels.sh runs the command to extract the labels from the xml dumps. The resulting files have the format data/enwiki-20200301-pages-meta-history*.xml-p*.7z_article_labelings.json and are included in this dataset in the archive enwiki202003-article_labelings-json.tar.gz. Taking a Sample of Quality Labels I used Apache Spark to merge the metadata from Wikiq with the quality labels and to draw a sample of articles where each quality class is equally represented. To...
Facebook
TwitterThe simulated community datasets were built using the virtualspecies V1.5.1 R package (Leroy et al., 2016), which generates spatially-explicit presence/absence matrices from habitat suitability maps. We simulated these suitability maps using Gaussian fields neutral landscapes produced using the NLMR V1.0 R package (Sciaini et al., 2018). To allow for some level of overlap between species suitability maps, we divided the γ-diversity (i.e., the total number of simulated species) by an adjustable correlation value to create several species groups that share suitability maps. Using a full factorial design, we developed 81 presence/absence maps varying across four axes (see Supplemental Table 1 and Supplemental Figure 1): 1) landscape size, representing the number of sites in the simulated landscape; 2) γ-diversity; 3) the level of correlation among species suitability maps, with greater correlations resulting in fewer shared species groups among suitability maps; and 4) the habitat suitabil...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data was sourced from the Bureau of Infrastructure, Transport and Regional Economics.\r For more information see https://bitre.gov.au/publications/2015/files/BITRE_yearbook_2015_full_report.pdf\r \r Dataset used to produce Figure BLT25 in Built environment. See; https://soe.environment.gov.au/theme/built-environment/topic/2016/livability-transport#built-environment-figure-BLT25\r
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Metabolites are referenced in spectral, structural and pathway databases with a diverse array of schemas, including various internal database identifiers and large tables of common name synonyms. Cross-linking metabolite identifiers is a required step for meta-analysis of metabolomic results across studies but made difficult due to the lack of a consensus identifier system. We have implemented metLinkR, an R package that leverages RefMet and RaMP-DB to automate and simplify cross-linking metabolite identifiers across studies and generating common names. MetLinkR accepts as input metabolite common names and identifiers from five different databases (HMDB, KEGG, ChEBI, LIPIDMAPS and PubChem) to exhaustively search for possible overlap in supplied metabolites from input data sets. In an example of 13 metabolomic data sets totaling 10,400 metabolites, metLinkR identified and provided common names for 1377 metabolites in common between at least 2 data sets in less than 18 min and produced standardized names for 74.4% of the input metabolites. In another example comprising five data sets with 3512 metabolites, metLinkR identified 715 metabolites in common between at least two data sets in under 12 min and produced standardized names for 82.3% of the input metabolites. Outputs of MetLInR include output tables and metrics allowing users to readily double check the mappings and to get an overview of chemical classes represented. Overall, MetLinkR provides a streamlined solution for a common task in metabolomic epidemiology and other fields that meta-analyze metabolomic data. The R package, vignette and source code are freely downloadable at https://github.com/ncats/metLinkR.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Queensland Government works closely with all Queenslanders to build safe, caring and connected communities, create jobs, protect our environment and enhance quality of life for present and future generations.\r \r To help achieve this, a number of grant and assistance programs are open to individuals, small community groups and non-government organisations right across the State.\r \r Funding programs cover a range of service areas and interests from sport, arts, and events to education, environment and Advance Queensland research and innovation initiatives.\r \r Expenditure data reflects amounts paid for these grant/funding and frontline service procurement programs. This is a consolidated dataset that can be used by the public, non-government organisations and Government departments to research the investment the Queensland Government makes across grant/funding programs. \r \r The information is correct at the time of reporting however figures and other details may be subject to change. Budgeted and expenditure amounts may also be subject to variation due to changes in project timing, scope or funding reallocations.\r \r Where individuals have received funding, these details have been aggregated within the broader program, funding categories or location, to maintain privacy.\r \r
Please view the Data Dictionary (see Data and resources) to understand the data attributes, quality and limitations of this data which should all be considered before using this data for any purpose.\r \r Due to program name changes, consolidation of funding into other programs, machinery-of-government changes and refinement in the data fields collected overtime, comparison of figures across different years may be ineffective.\r \r
This dataset provides a consolidated view of payments made by all Queensland Government agencies responsible for the administration of grant and funding programs. The data contained within this file is not owned by Queensland Treasury. \r \r To view the data as provided by the data owners, please use the link under Related resources. Any questions regarding the data should be directed to the respective Queensland Government agency responsible for the payment. You can contact the responsible agency via the Contact Us option.\r \r
Facebook
TwitterPartially clonal species are common in the Tree of Life. And yet, population genetics models still mostly focus on the extremes: strictly sexual versus purely asexual reproduction. Here we present an R package built upon GenClone software including new functions and several improvements. The RClone package includes functions to handle clonal datasets, allowing (i) checking for dataset reliability to discriminate multi-locus genotypes (MLG), (ii) ascertainment of MLG and semi-automatic determination of clonal lineages (MLL), (iii) genotypic richness and evenness indices calculation based on MLGs or MLLs, and (iv) describing several spatial components of clonality. RClone allows the one shot analysis of multi-population datasets without size limitation, suitable for datasets now increasingly produced through Next Generation Sequencing. A major improvement compared to existing software is the ability to determine the threshold to cluster similar MLGs into MLLs, based on implemented simulat...
Facebook
TwitterThe goal of introducing the Rescaled Fashion-MNIST dataset is to provide a dataset that contains scale variations (up to a factor of 4), to evaluate the ability of networks to generalise to scales not present in the training data.
The Rescaled Fashion-MNIST dataset was introduced in the paper:
[1] A. Perzanowski and T. Lindeberg (2025) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, Journal of Mathematical Imaging and Vision, 67(29), https://doi.org/10.1007/s10851-025-01245-x.
with a pre-print available at arXiv:
[2] Perzanowski and Lindeberg (2024) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, arXiv preprint arXiv:2409.11140.
Importantly, the Rescaled Fashion-MNIST dataset is more challenging than the MNIST Large Scale dataset, introduced in:
[3] Y. Jansson and T. Lindeberg (2022) "Scale-invariant scale-channel networks: Deep networks that generalise to previously unseen scales", Journal of Mathematical Imaging and Vision, 64(5): 506-536, https://doi.org/10.1007/s10851-022-01082-2.
The Rescaled Fashion-MNIST dataset is provided on the condition that you provide proper citation for the original Fashion-MNIST dataset:
[4] Xiao, H., Rasul, K., and Vollgraf, R. (2017) “Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms”, arXiv preprint arXiv:1708.07747
and also for this new rescaled version, using the reference [1] above.
The data set is made available on request. If you would be interested in trying out this data set, please make a request in the system below, and we will grant you access as soon as possible.
The Rescaled FashionMNIST dataset is generated by rescaling 28×28 gray-scale images of clothes from the original FashionMNIST dataset [4]. The scale variations are up to a factor of 4, and the images are embedded within black images of size 72x72, with the object in the frame always centred. The imresize() function in Matlab was used for the rescaling, with default anti-aliasing turned on, and bicubic interpolation overshoot removed by clipping to the [0, 255] range. The details of how the dataset was created can be found in [1].
There are 10 different classes in the dataset: “T-shirt/top”, “trouser”, “pullover”, “dress”, “coat”, “sandal”, “shirt”, “sneaker”, “bag” and “ankle boot”. In the dataset, these are represented by integer labels in the range [0, 9].
The dataset is split into 50 000 training samples, 10 000 validation samples and 10 000 testing samples. The training dataset is generated using the initial 50 000 samples from the original Fashion-MNIST training set. The validation dataset, on the other hand, is formed from the final 10 000 images of that same training set. For testing, all test datasets are built from the 10 000 images contained in the original Fashion-MNIST test set.
The training dataset file (~2.9 GB) for scale 1, which also contains the corresponding validation and test data for the same scale, is:
fashionmnist_with_scale_variations_tr50000_vl10000_te10000_outsize72-72_scte1p000_scte1p000.h5
Additionally, for the Rescaled FashionMNIST dataset, there are 9 datasets (~415 MB each) for testing scale generalisation at scales not present in the training set. Each of these datasets is rescaled using a different image scaling factor, 2k/4, with k being integers in the range [-4, 4]:
fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p500.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p595.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p707.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p841.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p000.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p189.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p414.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p682.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte2p000.h5
These dataset files were used for the experiments presented in Figures 6, 7, 14, 16, 19 and 23 in [1].
The datasets are saved in HDF5 format, with the partitions in the respective h5 files named as
('/x_train', '/x_val', '/x_test', '/y_train', '/y_test', '/y_val'); which ones exist depends on which data split is used.
The training dataset can be loaded in Python as:
with h5py.File(`
x_train = np.array( f["/x_train"], dtype=np.float32)
x_val = np.array( f["/x_val"], dtype=np.float32)
x_test = np.array( f["/x_test"], dtype=np.float32)
y_train = np.array( f["/y_train"], dtype=np.int32)
y_val = np.array( f["/y_val"], dtype=np.int32)
y_test = np.array( f["/y_test"], dtype=np.int32)
We also need to permute the data, since Pytorch uses the format [num_samples, channels, width, height], while the data is saved as [num_samples, width, height, channels]:
x_train = np.transpose(x_train, (0, 3, 1, 2))
x_val = np.transpose(x_val, (0, 3, 1, 2))
x_test = np.transpose(x_test, (0, 3, 1, 2))
The test datasets can be loaded in Python as:
with h5py.File(`
x_test = np.array( f["/x_test"], dtype=np.float32)
y_test = np.array( f["/y_test"], dtype=np.int32)
The test datasets can be loaded in Matlab as:
x_test = h5read(`
The images are stored as [num_samples, x_dim, y_dim, channels] in HDF5 files. The pixel intensity values are not normalised, and are in a [0, 255] range.
There is also a closely related Fashion-MNIST with translations dataset, which in addition to scaling variations also comprises spatial translations of the objects.
Facebook
TwitterThis dataset consists of 300 synthetically generated foggy images created from an original set of 100 real-world vehicle images. Each image has been augmented into three distinct fog conditions: low fog, moderate fog, and high fog, using a custom fog simulation filter built in Python and OpenCV. The goal of this dataset is to support research and development of robust vehicle and object detection models under varying foggy weather scenarios, where visibility is typically reduced and object edges are distorted.
The dataset is organized into three folders:
low_fog/: Images with mild fog overlay.
moderate_fog/: Images with medium-density fog.
high_fog/: Images with dense fog simulation.
This dataset is ideal for training, testing, and benchmarking computer vision models (e.g., YOLO, Faster R-CNN, ResNet) in low-visibility conditions, especially for autonomous vehicles and surveillance systems.
Cite these papers
[1] Kadhiravan, D., Pradeepa, J., & Ragavan, K. (2024). Remote Disease Diagnosis through IoMT-Enhanced Blood Cell Classification with Deep Learning. The Open Biomedical Engineering Journal, 18(1). [2] Kadhiravan, D., Sharmila, A., & Pradeepa, J. INVESTIGATION OF OPTIMIZING EFFICIENCY ON ORIENTATION EFFECTS ON WIRELESS POWER TRANSFER. [3] Darly, S. S., Kadhiravan, D., Hemachandran, K., & Rege, M. (2024). Simulation strategies for analyzing of data. Handbook of Artificial Intelligence and Wearables, 27-64.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview
This dataset is the repository for the following paper submitted to Data in Brief:
Kempf, M. A dataset to model Levantine landcover and land-use change connected to climate change, the Arab Spring and COVID-19. Data in Brief (submitted: December 2023).
The Data in Brief article contains the supplement information and is the related data paper to:
Kempf, M. Climate change, the Arab Spring, and COVID-19 - Impacts on landcover transformations in the Levant. Journal of Arid Environments (revision submitted: December 2023).
Description/abstract
The Levant region is highly vulnerable to climate change, experiencing prolonged heat waves that have led to societal crises and population displacement. Since 2010, the area has been marked by socio-political turmoil, including the Syrian civil war and currently the escalation of the so-called Israeli-Palestinian Conflict, which strained neighbouring countries like Jordan due to the influx of Syrian refugees and increases population vulnerability to governmental decision-making. Jordan, in particular, has seen rapid population growth and significant changes in land-use and infrastructure, leading to over-exploitation of the landscape through irrigation and construction. This dataset uses climate data, satellite imagery, and land cover information to illustrate the substantial increase in construction activity and highlights the intricate relationship between climate change predictions and current socio-political developments in the Levant.
Folder structure
The main folder after download contains all data, in which the following subfolders are stored are stored as zipped files:
“code” stores the above described 9 code chunks to read, extract, process, analyse, and visualize the data.
“MODIS_merged” contains the 16-days, 250 m resolution NDVI imagery merged from three tiles (h20v05, h21v05, h21v06) and cropped to the study area, n=510, covering January 2001 to December 2022 and including January and February 2023.
“mask” contains a single shapefile, which is the merged product of administrative boundaries, including Jordan, Lebanon, Israel, Syria, and Palestine (“MERGED_LEVANT.shp”).
“yield_productivity” contains .csv files of yield information for all countries listed above.
“population” contains two files with the same name but different format. The .csv file is for processing and plotting in R. The .ods file is for enhanced visualization of population dynamics in the Levant (Socio_cultural_political_development_database_FAO2023.ods).
“GLDAS” stores the raw data of the NASA Global Land Data Assimilation System datasets that can be read, extracted (variable name), and processed using code “8_GLDAS_read_extract_trend” from the respective folder. One folder contains data from 1975-2022 and a second the additional January and February 2023 data.
“built_up” contains the landcover and built-up change data from 1975 to 2022. This folder is subdivided into two subfolder which contain the raw data and the already processed data. “raw_data” contains the unprocessed datasets and “derived_data” stores the cropped built_up datasets at 5 year intervals, e.g., “Levant_built_up_1975.tif”.
Code structure
1_MODIS_NDVI_hdf_file_extraction.R
This is the first code chunk that refers to the extraction of MODIS data from .hdf file format. The following packages must be installed and the raw data must be downloaded using a simple mass downloader, e.g., from google chrome. Packages: terra. Download MODIS data from after registration from: https://lpdaac.usgs.gov/products/mod13q1v061/ or https://search.earthdata.nasa.gov/search (MODIS/Terra Vegetation Indices 16-Day L3 Global 250m SIN Grid V061, last accessed, 09th of October 2023). The code reads a list of files, extracts the NDVI, and saves each file to a single .tif-file with the indication “NDVI”. Because the study area is quite large, we have to load three different (spatially) time series and merge them later. Note that the time series are temporally consistent.
2_MERGE_MODIS_tiles.R
In this code, we load and merge the three different stacks to produce large and consistent time series of NDVI imagery across the study area. We further use the package gtools to load the files in (1, 2, 3, 4, 5, 6, etc.). Here, we have three stacks from which we merge the first two (stack 1, stack 2) and store them. We then merge this stack with stack 3. We produce single files named NDVI_final_*consecutivenumber*.tif. Before saving the final output of single merged files, create a folder called “merged” and set the working directory to this folder, e.g., setwd("your directory_MODIS/merged").
3_CROP_MODIS_merged_tiles.R
Now we want to crop the derived MODIS tiles to our study area. We are using a mask, which is provided as .shp file in the repository, named "MERGED_LEVANT.shp". We load the merged .tif files and crop the stack with the vector. Saving to individual files, we name them “NDVI_merged_clip_*consecutivenumber*.tif. We now produced single cropped NDVI time series data from MODIS. The repository provides the already clipped and merged NDVI datasets.
4_TREND_analysis_NDVI.R
Now, we want to perform trend analysis from the derived data. The data we load is tricky as it contains 16-days return period across a year for the period of 22 years. Growing season sums contain MAM (March-May), JJA (June-August), and SON (September-November). December is represented as a single file, which means that the period DJF (December-February) is represented by 5 images instead of 6. For the last DJF period (December 2022), the data from January and February 2023 can be added. The code selects the respective images from the stack, depending on which period is under consideration. From these stacks, individual annually resolved growing season sums are generated and the slope is calculated. We can then extract the p-values of the trend and characterize all values with high confidence level (0.05). Using the ggplot2 package and the melt function from reshape2 package, we can create a plot of the reclassified NDVI trends together with a local smoother (LOESS) of value 0.3.To increase comparability and understand the amplitude of the trends, z-scores were calculated and plotted, which show the deviation of the values from the mean. This has been done for the NDVI values as well as the GLDAS climate variables as a normalization technique.
5_BUILT_UP_change_raster.R
Let us look at the landcover changes now. We are working with the terra package and get raster data from here: https://ghsl.jrc.ec.europa.eu/download.php?ds=bu (last accessed 03. March 2023, 100 m resolution, global coverage). Here, one can download the temporal coverage that is aimed for and reclassify it using the code after cropping to the individual study area. Here, I summed up different raster to characterize the built-up change in continuous values between 1975 and 2022.
6_POPULATION_numbers_plot.R
For this plot, one needs to load the .csv-file “Socio_cultural_political_development_database_FAO2023.csv” from the repository. The ggplot script provided produces the desired plot with all countries under consideration.
7_YIELD_plot.R
In this section, we are using the country productivity from the supplement in the repository “yield_productivity” (e.g., "Jordan_yield.csv". Each of the single country yield datasets is plotted in a ggplot and combined using the patchwork package in R.
8_GLDAS_read_extract_trend
The last code provides the basis for the trend analysis of the climate variables used in the paper. The raw data can be accessed https://disc.gsfc.nasa.gov/datasets?keywords=GLDAS%20Noah%20Land%20Surface%20Model%20L4%20monthly&page=1 (last accessed 9th of October 2023). The raw data comes in .nc file format and various variables can be extracted using the [“^a variable name”] command from the spatraster collection. Each time you run the code, this variable name must be adjusted to meet the requirements for the variables (see this link for abbreviations: https://disc.gsfc.nasa.gov/datasets/GLDAS_CLSM025_D_2.0/summary, last accessed 09th of October 2023; or the respective code chunk when reading a .nc file with the ncdf4 package in R) or run print(nc) from the code or use names(the spatraster collection). Choosing one variable, the code uses the MERGED_LEVANT.shp mask from the repository to crop and mask the data to the outline of the study area.From the processed data, trend analysis are conducted and z-scores were calculated following the code described above. However, annual trends require the frequency of the time series analysis to be set to value = 12. Regarding, e.g., rainfall, which is measured as annual sums and not means, the chunk r.sum=r.sum/12 has to be removed or set to r.sum=r.sum/1 to avoid calculating annual mean values (see other variables). Seasonal subset can be calculated as described in the code. Here, 3-month subsets were chosen for growing seasons, e.g. March-May (MAM), June-July (JJA), September-November (SON), and DJF (December-February, including Jan/Feb of the consecutive year).From the data, mean values of 48 consecutive years are calculated and trend analysis are performed as describe above. In the same way, p-values are extracted and 95 % confidence level values are marked with dots on the raster plot. This analysis can be performed with a much longer time series, other variables, ad different spatial extent across the globe due to the availability of the GLDAS variables.
(9_workflow_diagramme) this simple code can be used to plot a workflow diagram and is detached from the actual analysis.
Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Resources, Data Curation, Writing - Original Draft, Writing - Review & Editing, Visualization, Supervision, Project administration, and Funding acquisition: Michael
Facebook
TwitterData update from CLIVAR AND CARBON HYDROGRAPHIC DATA OFFICE (CCHDO)
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The PRIDE database is the largest public data repository of mass spectrometry-based proteomics data and currently stores more than 40,000 data sets covering a wide range of organisms, experimental techniques, and biological conditions. During the past few years, PRIDE has seen a significant increase in the amount of submitted data-independent acquisition (DIA) proteomics data sets. This provides an excellent opportunity for large-scale data reanalysis and reuse. We have reanalyzed 15 public label-free DIA data sets across various healthy human tissues to provide a state-of-the-art view of the human proteome in baseline conditions (without any perturbations). We computed baseline protein abundances and compared them across various tissues, samples, and data sets. Our second aim was to compare protein abundances obtained here from the results of previous analyses using human baseline data-dependent acquisition (DDA) data sets. We observed a good correlation across some tissues, especially in the liver and colon, but weak correlations were found in others, such as the lung and pancreas. The reanalyzed results including protein abundance values and curated metadata are made available to view and download from the resource Expression Atlas.
Facebook
TwitterDonation sent to the University of Idaho Library Government Documents Librarian a CD containing General Land Office maps on it. A readme file on the CD contains this information:"I obtained the attached GLO maps from Mitch Price at River Design Group who obtained them from another source. These maps apparently do not have a date, I assume it was stripped off when they were rectified. These maps show the Great Northern Rail line, it arrived in Bonners Ferry in 1892. The Spokane International Railroad (Union Pacific purchased this line) built a bridge across the Kootenai R. in 1906." "I am a bit puzzled on the map dates, the Kootenai River Master Plan indicated these maps are 1862-65 but they also show the Great Northern Rail line but not the Spokane International Railroad which seems to place them somewhere between 1892 - 1906 unless perhaps they were revised at a later date."Gary Barton USGS Tacoma, WA 253-552-1613 officegbarton@usgs.gov
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By SocialGrep [source]
A subreddit dataset is a collection of posts and comments made on Reddit's /r/datasets board. This dataset contains all the posts and comments made on the /r/datasets subreddit from its inception to March 1, 2022. The dataset was procured using SocialGrep. The data does not include usernames to preserve users' anonymity and to prevent targeted harassment
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
In order to use this dataset, you will need to have a text editor such as Microsoft Word or LibreOffice installed on your computer. You will also need a web browser such as Google Chrome or Mozilla Firefox.
Once you have the necessary software installed, open the The Reddit Dataset folder and double-click on the the-reddit-dataset-dataset-posts.csv file to open it in your preferred text editor.
In the document, you will see a list of posts with the following information for each one: title, sentiment, score, URL, created UTC, permalink, subreddit NSFW status, and subreddit name.
You can use this information to analyze trends in data sets posted on /r/datasets over time. For example, you could calculate the average score for all posts and compare it to the average score for posts in specific subReddits. Additionally, sentiment analysis could be performed on the titles of posts to see if there is a correlation between positive/negative sentiment and upvotes/downvotes
- Finding correlations between different types of datasets
- Determining which datasets are most popular on Reddit
- Analyzing the sentiments of post and comments on Reddit's /r/datasets board
If you use this dataset in your research, please credit the original authors.
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: the-reddit-dataset-dataset-comments.csv | Column name | Description | |:-------------------|:---------------------------------------------------| | type | The type of post. (String) | | subreddit.name | The name of the subreddit. (String) | | subreddit.nsfw | Whether or not the subreddit is NSFW. (Boolean) | | created_utc | The time the post was created, in UTC. (Timestamp) | | permalink | The permalink for the post. (String) | | body | The body of the post. (String) | | sentiment | The sentiment of the post. (String) | | score | The score of the post. (Integer) |
File: the-reddit-dataset-dataset-posts.csv | Column name | Description | |:-------------------|:---------------------------------------------------| | type | The type of post. (String) | | subreddit.name | The name of the subreddit. (String) | | subreddit.nsfw | Whether or not the subreddit is NSFW. (Boolean) | | created_utc | The time the post was created, in UTC. (Timestamp) | | permalink | The permalink for the post. (String) | | score | The score of the post. (Integer) | | domain | The domain of the post. (String) | | url | The URL of the post. (String) | | selftext | The self-text of the post. (String) | | title | The title of the post. (String) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit SocialGrep.