Facebook
Twitterhttp://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
On the official website the dataset is available over SQL server (localhost) and CSVs to be used via Power BI Desktop running on Virtual Lab (Virtaul Machine). As per first two steps of Importing data are executed in the virtual lab and then resultant Power BI tables are copied in CSVs. Added records till year 2022 as required.
this dataset will be helpful in case you want to work offline with Adventure Works data in Power BI desktop in order to carry lab instructions as per training material on official website. The dataset is useful in case you want to work on Power BI desktop Sales Analysis example from Microsoft website PL 300 learning.
Download the CSV file(s) and import in Power BI desktop as tables. The CSVs are named as tables created after first two steps of importing data as mentioned in the PL-300 Microsoft Power BI Data Analyst exam lab.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. Individual reports can be parsed by export session ID (column A) or timestamp (column B). Variation between output represents use of different types of Fitbit trackers and individual tracking behaviors / preferences.
This is the list of manipulations performed on the original dataset, published by Möbius.
All the cleaning process and rearrangements were performed in BigQuery, using SQL functions.
1) After I took a closer look at the source dataset, I realized that for my case study, I did not need some of the tables contained in the original archive. Therefore, I decided not to import
- dailyCalories_merged.csv,
- dailyIntensities_merged.csv,
- dailySteps_merged.csv.
as they proved redundant, their content could be found in the dailyActivity_merged.csv file.
In addition, the files
- minutesCaloriesWide_merged.csv,
- minutesIntensitiesWide_merged.csv,
- minuteStepsWide_merged.csv.
were not imported, as they presented the same data contained in other files in a wide format. Hence, only the files with long format containing the same data were imported in the BigQuery database.
2) To be able to compare and measure the correlation among different variables based on hourly records, I decided to create a new table based on LEFT JOIN function and columns Id and ActivityHour. I repeated the same JOIN on tables with minute records. Hence I obtained 2 new tables: - hourly_activity.csv, - minute_activity.csv.
3) To validate most of the columns containing DATE and DATETIME values that were imported as STRING data type, I used the PARSE_DATE() and PARSE_DATETIME() commands. While importing the - heartrate_seconds_merged.csv, - hourlyCalories_merged.csv, - hourlyIntensities_merged.csv, - hourlySteps_merged.csv, - minutesCaloriesNarrow_merged.csv, - minuteIntensitiesNarrow_merged.csv, - minuteMETsNarrow_merged.csv, - minuteSleep_merged.csv, - minuteSteps_merged.csv, - sleepDay_merge.csv, - weigthLog_Info_merged.csv files to BigQuery, it was necessary to import the DATETIME and DATE type columns as STRING, because the original syntax, used in the CSV files, couldn’t be recognized as a correct DATETIME data type, due to “AM” and “PM” text at the end of the expression.
Facebook
Twitteranalyze the health and retirement study (hrs) with r the hrs is the one and only longitudinal survey of american seniors. with a panel starting its third decade, the current pool of respondents includes older folks who have been interviewed every two years as far back as 1992. unlike cross-sectional or shorter panel surveys, respondents keep responding until, well, death d o us part. paid for by the national institute on aging and administered by the university of michigan's institute for social research, if you apply for an interviewer job with them, i hope you like werther's original. figuring out how to analyze this data set might trigger your fight-or-flight synapses if you just start clicking arou nd on michigan's website. instead, read pages numbered 10-17 (pdf pages 12-19) of this introduction pdf and don't touch the data until you understand figure a-3 on that last page. if you start enjoying yourself, here's the whole book. after that, it's time to register for access to the (free) data. keep your username and password handy, you'll need it for the top of the download automation r script. next, look at this data flowchart to get an idea of why the data download page is such a righteous jungle. but wait, good news: umich recently farmed out its data management to the rand corporation, who promptly constructed a giant consolidated file with one record per respondent across the whole panel. oh so beautiful. the rand hrs files make much of the older data and syntax examples obsolete, so when you come across stuff like instructions on how to merge years, you can happily ignore them - rand has done it for you. the health and retirement study only includes noninstitutionalized adults when new respondents get added to the panel (as they were in 1992, 1993, 1998, 2004, and 2010) but once they're in, they're in - respondents have a weight of zero for interview waves when they were nursing home residents; but they're still responding and will continue to contribute to your statistics so long as you're generalizing about a population from a previous wave (for example: it's possible to compute "among all americans who were 50+ years old in 1998, x% lived in nursing homes by 2010"). my source for that 411? page 13 of the design doc. wicked. this new github repository contains five scripts: 1992 - 2010 download HRS microdata.R loop through every year and every file, download, then unzip everything in one big party impor t longitudinal RAND contributed files.R create a SQLite database (.db) on the local disk load the rand, rand-cams, and both rand-family files into the database (.db) in chunks (to prevent overloading ram) longitudinal RAND - analysis examples.R connect to the sql database created by the 'import longitudinal RAND contributed files' program create tw o database-backed complex sample survey object, using a taylor-series linearization design perform a mountain of analysis examples with wave weights from two different points in the panel import example HRS file.R load a fixed-width file using only the sas importation script directly into ram with < a href="http://blog.revolutionanalytics.com/2012/07/importing-public-data-with-sas-instructions-into-r.html">SAScii parse through the IF block at the bottom of the sas importation script, blank out a number of variables save the file as an R data file (.rda) for fast loading later replicate 2002 regression.R connect to the sql database created by the 'import longitudinal RAND contributed files' program create a database-backed complex sample survey object, using a taylor-series linearization design exactly match the final regression shown in this document provided by analysts at RAND as an update of the regression on pdf page B76 of this document . click here to view these five scripts for more detail about the health and retirement study (hrs), visit: michigan's hrs homepage rand's hrs homepage the hrs wikipedia page a running list of publications using hrs notes: exemplary work making it this far. as a reward, here's the detailed codebook for the main rand hrs file. note that rand also creates 'flat files' for every survey wave, but really, most every analysis you c an think of is possible using just the four files imported with the rand importation script above. if you must work with the non-rand files, there's an example of how to import a single hrs (umich-created) file, but if you wish to import more than one, you'll have to write some for loops yourself. confidential to sas, spss, stata, and sudaan users: a tidal wave is coming. you can get water up your nose and be dragged out to sea, or you can grab a surf board. time to transition to r. :D
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Objective:
Improve understanding of real estate performance.
Leverage data to support business decisions.
Scope:
Track property sales, visits, and performance metrics.
Step 1: Creating an Azure SQL Database
Action: Provisioned an Azure SQL Database to host real estate data.
Why Azure?: Scalability, security, and integration with Power BI.
Step 2: Importing Data
Action: Imported datasets (properties, visits, sales, agents, etc.) into the SQL database.
Tools Used: SQL Server Management Studio (SSMS) and Azure Data Studio.
Step 3: Data Transformation in SQL
Normalized Data: Ensured data consistency by normalizing the formats of dates and categorical fields.
Calculated Fields:
Time on Market: DATEDIFF function to calculate the difference between listing and sale dates.
Conversion Rate: Aggregated sales and visits data using COUNT and SUM to calculate conversion rates per agent and property.
Buyer Segmentation: Identified first-time vs repeat buyers using JOINs and COUNT functions.
Data Cleaning: Removed duplicates, handled null values, and standardized city names and property types.
Step 4: Connecting Power BI to Azure SQL
Action: Established a live connection to Azure SQL Database in Power BI.
Benefit: Real-time data updates and efficient analysis.
Step 5: Data Modeling in Power BI
Relationships:
Defined relationships between tables (e.g., Sales, Visits, Properties, Agents) using primary and foreign keys.
Utilized active and inactive relationships for dynamic calculations like time-based comparisons.
Calculated Columns and Measures:
Time on Market: Created a calculated measure using DATEDIFF.
Conversion Rates: Used DIVIDE and CALCULATE for accurate per-agent and per-property analysis.
Step 6: Creating Visualizations
Key Visuals:
Sales Heatmap by City: Geographic visualization to highlight sales performance.
Conversion Rates: Bar charts and line graphs for trend analysis.
Time on Market: Boxplots and histograms for distribution insights.
Buyer Segmentation: Pie charts and bar graphs to show buyer profiles.
Step 7: Building Dashboards
Page 1: Overview (Key Metrics and Sales Heatmap).
Page 2: Performance Analysis (Conversion Rates, Time on Market).
Page 3: Buyer Insights (First-Time vs Repeat Buyers, Property Distribution).
Insight 1: Sales Performance by City
Cities highest sales volume.
City low performance, requiring further investigation.
Insight 2: Conversion Rates
Agent highest conversion rate.
Certain properties (e.g., luxury villas) outperform others in conversion.
Insight 3: Time on Market
Average time on market.
Insight 4: Buyer Trends
Repeat Buyers make up 60% of purchases.
First-Time Buyers prefer apartments over villas.
Focus on High-Performing Cities Recommendation 2: Support Low-Performing Areas
Investigate challenges to develop targeted marketing strategies.
Enhance Conversion Rates
Train agents based on techniques used by top performers.
Prioritize marketing for properties with high conversion rates.
Engage First-Time Buyers
Create specific campaigns for apartments to attract first-time buyers.
Offer financial guidance programs to boost their confidence.
Summary:
Built a robust data solution from Azure SQL to Power BI.
Derived actionable insights that can drive real estate growth.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains software metric and design pattern data for around 100,000 projects from the Maven Central repository. The data was collected and analyzed as part of my master's thesis "Mining Software Repositories for the Effects of Design Patterns on Software Quality" (https://www.overleaf.com/read/vnfhydqxmpvx, https://zenodo.org/record/4048275).
The included qualisign.* files all contain the same data in different formats: - qualisign.sql: standard SQL format (exported using "pg_dump --inserts ..."), - qualisign.psql: PostgreSQL plain format (exported using "pg_dump -Fp ..."), - qualisign.csql: PostgreSQL custom format (exported using "pg_dump -Fc ...").
create-tables.sql has to be executed before importing one of the qualisign.* files. Once qualisign.*sql has been imported, create-views.sql can be executed to preprocess the data, thereby creating materialized views that are more appropriate for data analysis purposes.
Software metrics were calculated using CKJM extended: http://gromit.iiar.pwr.wroc.pl/p_inf/ckjm/
Included software metrics are (21 total): - AMC: Average Method Complexity - CA: Afferent Coupling - CAM: Cohesion Among Methods - CBM: Coupling Between Methods - CBO: Coupling Between Objects - CC: Cyclomatic Complexity - CE: Efferent Coupling - DAM: Data Access Metric - DIT: Depth of Inheritance Tree - IC: Inheritance Coupling - LCOM: Lack of Cohesion of Methods (Chidamber and Kemerer) - LCOM3: Lack of Cohesion of Methods (Constantine and Graham) - LOC: Lines of Code - MFA: Measure of Functional Abstraction - MOA: Measure of Aggregation - NOC: Number of Children - NOM: Number of Methods - NOP: Number of Polymorphic Methods - NPM: Number of Public Methods - RFC: Response for Class - WMC: Weighted Methods per Class
In the qualisign.* data, these metrics are only available on the class level. create-views.sql additionally provides averages of these metrics on the package and project levels.
Design patterns were detected using SSA: https://users.encs.concordia.ca/~nikolaos/pattern_detection.html
Included design patterns are (15 total): - Adapter - Bridge - Chain of Responsibility - Command - Composite - Decorator - Factory Method - Observer - Prototype - Proxy - Singleton - State - Strategy - Template Method - Visitor
The code to generate the dataset is available at: https://github.com/jaichberg/qualisign
The code to perform quality analysis on the dataset is available at: https://github.com/jaichberg/qualisign-analysis
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present three defect rediscovery datasets mined from Bugzilla. The datasets capture data for three groups of open source software projects: Apache, Eclipse, and KDE. The datasets contain information about approximately 914 thousands of defect reports over a period of 18 years (1999-2017) to capture the inter-relationships among duplicate defects.
File Descriptions
apache.csv - Apache Defect Rediscovery dataset
eclipse.csv - Eclipse Defect Rediscovery dataset
kde.csv - KDE Defect Rediscovery dataset
apache.relations.csv - Inter-relations of rediscovered defects of Apache
eclipse.relations.csv - Inter-relations of rediscovered defects of Eclipse
kde.relations.csv - Inter-relations of rediscovered defects of KDE
create_and_populate_neo4j_objects.cypher - Populates Neo4j graphDB by importing all the data from the CSV files. Note that you have to set dbms.import.csv.legacy_quote_escaping configuration setting to false to load the CSV files as per https://neo4j.com/docs/operations-manual/current/reference/configuration-settings/#config_dbms.import.csv.legacy_quote_escaping
create_and_populate_mysql_objects.sql - Populates MySQL RDBMS by importing all the data from the CSV files
rediscovery_db_mysql.zip - For your convenience, we also provide full backup of the MySQL database
neo4j_examples.txt - Sample Neo4j queries
mysql_examples.txt - Sample MySQL queries
rediscovery_eclipse_6325.png - Output of Neo4j example #1
distinct_attrs.csv - Distinct values of bug_status, resolution, priority, severity for each project
Facebook
Twitterhttps://www.usa.gov/government-works/https://www.usa.gov/government-works/
This data set was retrieved from the Transtats webpage of the Bureau of Transportation Statistics of the US Department of Transportation. This data was cleaned and made ready for use in an university project where the goal was to compare different database engines in terms of performance and more.
NOTE: December 2020 was not included in the data set since it was not made available by the BTS as of today, 18th Feb 2021.
The data is split into CSV files for each year 2015 to 2020. flights.csv contains all the data in one file. The CSV files contain no headers for the columns. The headers are as follows:
'YEAR', 'MONTH', 'DAY_OF_MONTH', 'DAY_OF_WEEK', 'OP_UNIQUE_CARRIER', 'ORIGIN_CITY_NAME',
'ORIGIN_STATE_ABR', 'DEST_CITY_NAME', 'DEST_STATE_ABR', 'CRS_DEP_TIME', 'DEP_DELAY_NEW',
'CRS_ARR_TIME', 'ARR_DELAY_NEW', 'CANCELLED', 'CANCELLATION_CODE', 'AIR_TIME', 'DISTANCE'
NOTE: The headers were removed due to the requirement of easily importing the data into SQL
If you have any questions about how I retrieved/cleaned the data or anything about my project, feel free to check out my Github repository or shoot me a message.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We introduce a self-describing serialized format for bulk biomedical data called the Portable Format for Biomedical (PFB) data. The Portable Format for Biomedical data is based upon Avro and encapsulates a data model, a data dictionary, the data itself, and pointers to third party controlled vocabularies. In general, each data element in the data dictionary is associated with a third party controlled vocabulary to make it easier for applications to harmonize two or more PFB files. We also introduce an open source software development kit (SDK) called PyPFB for creating, exploring and modifying PFB files. We describe experimental studies showing the performance improvements when importing and exporting bulk biomedical data in the PFB format versus using JSON and SQL formats.
Facebook
Twitterhttps://www.worldbank.org/en/about/legal/terms-of-use-for-datasetshttps://www.worldbank.org/en/about/legal/terms-of-use-for-datasets
This dataset contains information on international student mobility and economic indicators for multiple countries and territories during 2020-2023. In addition, it includes data on Importing Countries and Exporting Countries, which refer to countries that receive (import) or send (export) international students.
The data was obtained from official open-data sources, mainly the UNESCO Institute for Statistics (UIS) and the World Bank open data portal. The original files were downloaded as CSV directly from these platforms. After downloading, the datasets were merged, filtered, and cleaned using SQL and excel (for example, removing duplicates, selecting specific years such as 2023 and 2024). No web scraping was used; all data comes from publicly available official databases.
Conclusions / Insights
Countries with negative average net flows, such as China and India, are net exporters of students, meaning they send more students abroad than they receive.
Countries with positive average net flows, such as the United States and other high-income countries, are net importers of students, attracting more international students than they send.
There appears to be a relationship between GDP per capita (PPP, 2023 USD) and student mobility patterns: countries with higher PPP tend to attract more international students, while countries with lower PPP tend to send more students abroad.
This dataset can be used to analyze trends in international student mobility, compare countries’ economic contexts, and identify patterns between student flows and national wealth.
The classification of countries as Importing or Exporting provides a quick way to group and compare countries in terms of international student dynamics.
Practical application: Universities and educational institutions can use this data to better understand potential target markets, identify countries that send or receive more students, and develop strategies for recruitment and international collaborations.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Data was imported from the BAK file found here into SQL Server, and then individual tables were exported as CSV. Jupyter Notebook containing the code used to clean the data can be found here
Version 6 has a some more cleaning and structuring that was noticed after importing in Power BI. Changes were made by adding code in python notebook to export new cleaned dataset, such as adding MonthNumber for sorting by month number, similar for WeekDayNumber.
Cleaning was done in python while also using SQL Server to quickly find things. Headers were added separately, ensuring no data loss.Data was cleaned for NaN, garbage values and other columns.
Facebook
TwitterThis dataset consists of basic statistics and career statistics provided by the NFL on their official website (http://www.nfl.com) for all players, active and retired.
All of the data was web scraped using Python code, which can be found and downloaded here: https://github.com/ytrevor81/NFL-Stats-Web-Scrape
Before we go into the specifics, it's important to note in the basic statistics and career statistics CSV files that all players are assigned a 'Player_Id'. This is the same ID used by the official NFL website to identify each player. This is useful in case of, for example, importing these CSV files in a SQL database for an app.
The data pulled for each player in Active_Player_Basic_Stats.csv is as follows: a. Player ID b. Full Name c. Position d. Number e. Current Team f. Height g. Height h. Weight i. Experience j. Age k. College
The data pulled for each player in Retired_Player_Basic_Stats.csv differs slightly from the previous data set. The data is as follows: a. Player ID b. Full Name c. Position f. Height g. Height h. Weight j. College k. Hall of Fame Status
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
Twitterhttp://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
On the official website the dataset is available over SQL server (localhost) and CSVs to be used via Power BI Desktop running on Virtual Lab (Virtaul Machine). As per first two steps of Importing data are executed in the virtual lab and then resultant Power BI tables are copied in CSVs. Added records till year 2022 as required.
this dataset will be helpful in case you want to work offline with Adventure Works data in Power BI desktop in order to carry lab instructions as per training material on official website. The dataset is useful in case you want to work on Power BI desktop Sales Analysis example from Microsoft website PL 300 learning.
Download the CSV file(s) and import in Power BI desktop as tables. The CSVs are named as tables created after first two steps of importing data as mentioned in the PL-300 Microsoft Power BI Data Analyst exam lab.