Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
analyze the current population survey (cps) annual social and economic supplement (asec) with r the annual march cps-asec has been supplying the statistics for the census bureau's report on income, poverty, and health insurance coverage since 1948. wow. the us census bureau and the bureau of labor statistics ( bls) tag-team on this one. until the american community survey (acs) hit the scene in the early aughts (2000s), the current population survey had the largest sample size of all the annual general demographic data sets outside of the decennial census - about two hundred thousand respondents. this provides enough sample to conduct state- and a few large metro area-level analyses. your sample size will vanish if you start investigating subgroups b y state - consider pooling multiple years. county-level is a no-no. despite the american community survey's larger size, the cps-asec contains many more variables related to employment, sources of income, and insurance - and can be trended back to harry truman's presidency. aside from questions specifically asked about an annual experience (like income), many of the questions in this march data set should be t reated as point-in-time statistics. cps-asec generalizes to the united states non-institutional, non-active duty military population. the national bureau of economic research (nber) provides sas, spss, and stata importation scripts to create a rectangular file (rectangular data means only person-level records; household- and family-level information gets attached to each person). to import these files into r, the parse.SAScii function uses nber's sas code to determine how to import the fixed-width file, then RSQLite to put everything into a schnazzy database. you can try reading through the nber march 2012 sas importation code yourself, but it's a bit of a proc freak show. this new github repository contains three scripts: 2005-2012 asec - download all microdata.R down load the fixed-width file containing household, family, and person records import by separating this file into three tables, then merge 'em together at the person-level download the fixed-width file containing the person-level replicate weights merge the rectangular person-level file with the replicate weights, then store it in a sql database create a new variable - one - in the data table 2012 asec - analysis examples.R connect to the sql database created by the 'download all microdata' progr am create the complex sample survey object, using the replicate weights perform a boatload of analysis examples replicate census estimates - 2011.R connect to the sql database created by the 'download all microdata' program create the complex sample survey object, using the replicate weights match the sas output shown in the png file below 2011 asec replicate weight sas output.png statistic and standard error generated from the replicate-weighted example sas script contained in this census-provided person replicate weights usage instructions document. click here to view these three scripts for more detail about the current population survey - annual social and economic supplement (cps-asec), visit: the census bureau's current population survey page the bureau of labor statistics' current population survey page the current population survey's wikipedia article notes: interviews are conducted in march about experiences during the previous year. the file labeled 2012 includes information (income, work experience, health insurance) pertaining to 2011. when you use the current populat ion survey to talk about america, subract a year from the data file name. as of the 2010 file (the interview focusing on america during 2009), the cps-asec contains exciting new medical out-of-pocket spending variables most useful for supplemental (medical spending-adjusted) poverty research. confidential to sas, spss, stata, sudaan users: why are you still rubbing two sticks together after we've invented the butane lighter? time to transition to r. :D
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The ATO (Australian Tax Office) made a dataset openly available (see links) showing all the Australian Salary and Wages (2002, 2006, 2010, 2014) by detailed occupation (around 1,000) and over 100 SA4 regions. Sole Trader sales and earnings are also provided. This open data (csv) is now packaged into a database (*.sql) with 45 sample SQL queries (backupSQL[date]_public.txt).See more description at related Figshare #datavis record. Versions:V5: Following #datascience course, I have made main data (individual salary and wages) available as csv and Jupyter Notebook. Checksum matches #dataTotals. In 209,xxx rows.Also provided Jobs, and SA4(Locations) description files as csv. More details at: Where are jobs growing/shrinking? Figshare DOI: 4056282 (linked below). Noted 1% discrepancy ($6B) in 2010 wages total - to follow up.#dataTotals - Salary and WagesYearWorkers (M)Earnings ($B) 20028.528520069.4372201010.2481201410.3584#dataTotal - Sole TradersYearWorkers (M)Sales ($B)Earnings ($B)20020.9611320061.0881920101.11122620141.19630#links See ATO request for data at ideascale link below.See original csv open data set (CC-BY) at data.gov.au link below.This database was used to create maps of change in regional employment - see Figshare link below (m9.figshare.4056282).#packageThis file package contains a database (analysing the open data) in SQL package and sample SQL text, interrogating the DB. DB name: test. There are 20 queries relating to Salary and Wages.#analysisThe database was analysed and outputs provided on Nectar(.org.au) resources at: http://118.138.240.130.(offline)This is only resourced for max 1 year, from July 2016, so will expire in June 2017. Hence the filing here. The sample home page is provided here (and pdf), but not all the supporting files, which may be packaged and added later. Until then all files are available at the Nectar URL. Nectar URL now offline - server files attached as package (html_backup[date].zip), including php scripts, html, csv, jpegs.#installIMPORT: DB SQL dump e.g. test_2016-12-20.sql (14.8Mb)1.Started MAMP on OSX.1.1 Go to PhpMyAdmin2. New Database: 3. Import: Choose file: test_2016-12-20.sql -> Go (about 15-20 seconds on MacBookPro 16Gb, 2.3 Ghz i5)4. four tables appeared: jobTitles 3,208 rows | salaryWages 209,697 rows | soleTrader 97,209 rows | stateNames 9 rowsplus views e.g. deltahair, Industrycodes, states5. Run test query under **#; Sum of Salary by SA4 e.g. 101 $4.7B, 102 $6.9B#sampleSQLselect sa4,(select sum(count) from salaryWageswhere year = '2014' and sa4 = sw.sa4) as thisYr14,(select sum(count) from salaryWageswhere year = '2010' and sa4 = sw.sa4) as thisYr10,(select sum(count) from salaryWageswhere year = '2006' and sa4 = sw.sa4) as thisYr06,(select sum(count) from salaryWageswhere year = '2002' and sa4 = sw.sa4) as thisYr02from salaryWages swgroup by sa4order by sa4
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset accompanies the paper:
"Beyond Text-to-SQL for IoT Defense: A Comprehensive Framework for Querying and Classifying IoT Threats"
Published in TrustNLP: Fifth Workshop on Trustworthy Natural Language Processing, colocated with NAACL 2025.
This dataset is designed to facilitate research in:
The dataset consists of three main components:
iot_database.sql.gz
)text-to-SQL-data.zip
)network_traffic_data.zip
)gunzip iot_database.sql.gz
mysql -u
SHOW TABLES;
If you use this dataset, please cite:
@inproceedings{pavlich2025beyond,
author = {Ryan Pavlich and Nima Ebadi and Richard Tarbell and Billy Linares and Adrian Tan and Rachael Humphreys and Jayanta Kumar Das and Rambod Ghandiparsi and Hannah Haley and Jerris George and Rocky Slavin and Kim-Kwang Raymond Choo and Glenn Dietrich and Anthony Rios},
title = {Beyond Text-to-SQL for IoT Defense: A Comprehensive Framework for Querying and Classifying IoT Threats},
booktitle = {TrustNLP: Fifth Workshop on Trustworthy Natural Language Processing},
year = {2025},
organization = {NAACL}
}
For questions or collaborations, contact Anthony Rios at Anthony.Rios@utsa.edu.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Worldwide Gender Differences in Public Code Contributions - Replication Package
This document describes how to replicate the findings of the paper: Davide Rossi and Stefano Zacchiroli, 2022, Worldwide Gender Differences in Public Code Contributions. In Software Engineering in Society (ICSE-SEIS'22), May 21-29, 2022, Pittsburgh, PA, USA. ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/3510458.3513011
This document comes with the software needed to mine and analyze the data presented in the paper.
Prerequisites
These instructions assume the use of the bash shell, the Python programming language, the PosgreSQL DBMS (version 11 or later), the zstd compression utility and various usual *nix shell utilities (cat, pv, ...), all of which are available for multiple architectures and OSs. It is advisable to create a Python virtual environment and install the following PyPI packages: click==8.0.3 cycler==0.10.0 gender-guesser==0.4.0 kiwisolver==1.3.2 matplotlib==3.4.3 numpy==1.21.3 pandas==1.3.4 patsy==0.5.2 Pillow==8.4.0 pyparsing==2.4.7 python-dateutil==2.8.2 pytz==2021.3 scipy==1.7.1 six==1.16.0 statsmodels==0.13.0
Initial data
swh-replica, a PostgreSQL database containing a copy of Software Heritage data. The schema for the database is available at https://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/sql/. We retrieved these data from Software Heritage, in collaboration with the archive operators, taking an archive snapshot as of 2021-07-07. We cannot make these data available in full as part of the replication package due to both its volume and the presence in it of personal information such as user email addresses. However, equivalent data (stripped of email addresses) can be obtained from the Software Heritage archive dataset, as documented in the article: Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli, The Software Heritage Graph Dataset: Public software development under one roof. In proceedings of MSR 2019: The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Pages 138-142, IEEE 2019. http://dx.doi.org/10.1109/MSR.2019.00030. Once retrieved, the data can be loaded in PostgreSQL to populate swh-replica.
names.tab - forenames and surnames per country with their frequency
zones.acc.tab - countries/territories, timezones, population and world zones
c_c.tab - ccTDL entities - world zones matches
Data preparation
Export data from the swh-replica database to create commits.csv.zst and authors.csv.zst sh> ./export.sh
Run the authors cleanup script to create authors--clean.csv.zst sh> ./cleanup.sh authors.csv.zst
Filter out implausible names and create authors--plausible.csv.zst sh> pv authors--clean.csv.zst | unzstd | ./filter_names.py 2> authors--plausible.csv.log | zstdmt > authors--plausible.csv.zst
Gender detection
Run the gender guessing script to create author-fullnames-gender.csv.zst sh> pv authors--plausible.csv.zst | unzstd | ./guess_gender.py --fullname --field 2 | zstdmt > author-fullnames-gender.csv.zst
Database creation and data ingestion
Create the PostgreSQL DB sh> createdb gender-commit Notice that from now on when prepending the psql> prompt we assume the execution of psql on the gender-commit database.
Import data into PostgreSQL DB sh> ./import_data.sh
Zone detection
Extract commits data from the DB and create commits.tab, that is used as input for the gender detection script sh> psql -f extract_commits.sql gender-commit
Run the world zone detection script to create commit_zones.tab.zst sh> pv commits.tab | ./assign_world_zone.py -a -n names.tab -p zones.acc.tab -x -w 8 | zstdmt > commit_zones.tab.zst Use ./assign_world_zone.py --help if you are interested in changing the script parameters.
Read zones assignment data from the file into the DB psql> \copy commit_culture from program 'zstdcat commit_zones.tab.zst | cut -f1,6 | grep -Ev ''\s$'''
Extraction and graphs
Run the script to execute the queries to extract the data to plot from the DB. This creates commits_tz.tab, authors_tz.tab, commits_zones.tab, authors_zones.tab, and authors_zones_1620.tab. Edit extract_data.sql if you whish to modify extraction parameters (start/end year, sampling, ...). sh> ./extract_data.sh
Run the script to create the graphs from all the previously extracted tabfiles. This will generate commits_tzs.pdf, authors_tzs.pdf, commits_zones.pdf, authors_zones.pdf, and authors_zones_1620.pdf. sh> ./create_charts.sh
Additional graphs
This package also includes some already-made graphs
authors_zones_1.pdf: stacked graphs showing the ratio of female authors per world zone through the years, considering all authors with at least one commit per period
authors_zones_2.pdf: ditto with at least two commits per period
authors_zones_10.pdf: ditto with at least ten commits per period
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Definitions of incidence and prevalence terms.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present three defect rediscovery datasets mined from Bugzilla. The datasets capture data for three groups of open source software projects: Apache, Eclipse, and KDE. The datasets contain information about approximately 914 thousands of defect reports over a period of 18 years (1999-2017) to capture the inter-relationships among duplicate defects.
File Descriptions
apache.csv - Apache Defect Rediscovery dataset
eclipse.csv - Eclipse Defect Rediscovery dataset
kde.csv - KDE Defect Rediscovery dataset
apache.relations.csv - Inter-relations of rediscovered defects of Apache
eclipse.relations.csv - Inter-relations of rediscovered defects of Eclipse
kde.relations.csv - Inter-relations of rediscovered defects of KDE
create_and_populate_neo4j_objects.cypher - Populates Neo4j graphDB by importing all the data from the CSV files. Note that you have to set dbms.import.csv.legacy_quote_escaping configuration setting to false to load the CSV files as per https://neo4j.com/docs/operations-manual/current/reference/configuration-settings/#config_dbms.import.csv.legacy_quote_escaping
create_and_populate_mysql_objects.sql - Populates MySQL RDBMS by importing all the data from the CSV files
rediscovery_db_mysql.zip - For your convenience, we also provide full backup of the MySQL database
neo4j_examples.txt - Sample Neo4j queries
mysql_examples.txt - Sample MySQL queries
rediscovery_eclipse_6325.png - Output of Neo4j example #1
distinct_attrs.csv - Distinct values of bug_status, resolution, priority, severity for each project
The data set contains an SQL-dump of a PostgreSQL data base. This data base contains concentrations of hazardous substances and other water quality parameters in different environmental compartments: river water (water and suspended sediments) ground water waste water (treated and untreated) and sewage sludge storm water runoff from combined and separate sewer systems atmospheric deposition soil Data from many different data sources were collected, cheked and combined and meta data were harmonized to allow for a combined data evaluation. The SQL-file was exported from a PostgreSQL 15.2 data base and compressed using 7zip into a zip-file (dhm3cinventoryV2.zip). Text-encoding is UTF-8. A short documentation (documentation_inventory_db_V2.0.pdf) and a listof known issues with the data which could not be resolved before publication (List_of_known_issues_V2.0.pdf) are enclosed as PDF files. This Version 2.0.0 of the database contains more data as for further data sets a publication agreement was reached and some data were reimported to resolve some errors created during data preparation for import. The database structure was extended and corrected at different points, leading to an improved data model.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Available functions in rEHR.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.