6 datasets found

MCB_languages_county
kaggle.com
Updated Oct 1, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marisol Brewster (2019). MCB_languages_county [Dataset]. https://www.kaggle.com/mcbrewster/mcb-languages-county/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 1, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Marisol Brewster
Description
Context

This is a dataset I found online through the Google Dataset Search portal.

Content

The American Community Survey (ACS) 2009-2013 multi-year data are used to list all languages spoken in the United States that were reported during the sample period. These tables provide detailed counts of many more languages than the 39 languages and language groups that are published annually as a part of the routine ACS data release. This is the second tabulation beyond 39 languages since ACS began.

The tables include all languages that were reported in each geography during the 2009 to 2013 sampling period. For the purpose of tabulation, reported languages are classified in one of 380 possible languages or language groups. Because the data are a sample of the total population, there may be languages spoken that are not reported, either because the ACS did not sample the households where those languages are spoken, or because the person filling out the survey did not report the language or reported another language instead.

The tables also provide information about self-reported English-speaking ability. Respondents who reported speaking a language other than English were asked to indicate their ability to speak English in one of the following categories: "Very well," "Well," "Not well," or "Not at all." The data on ability to speak English represent the person’s own perception about his or her own ability or, because ACS questionnaires are usually completed by one household member, the responses may represent the perception of another household member.

These tables are also available through the Census Bureau's application programming interface (API). Please see the developers page for additional details on how to use the API to access these data.

Acknowledgements

Sources:

Google Dataset Search: https://toolbox.google.com/datasetsearch

2009-2013 American Community Survey

Original dataset: https://www.census.gov/data/tables/2013/demo/2009-2013-lang-tables.html

Downloaded From: https://data.world/kvaughn/languages-county

Banner and thumbnail photo by Farzad Mohsenvand on Unsplash
Financial Statement Data Sets
kaggle.com
Updated Jul 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vadim Vanak (2025). Financial Statement Data Sets [Dataset]. https://www.kaggle.com/datasets/vadimvanak/company-facts-2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 4, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Vadim Vanak
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset offers a detailed collection of US-GAAP financial data extracted from the financial statements of exchange-listed U.S. companies, as submitted to the U.S. Securities and Exchange Commission (SEC) via the EDGAR database. Covering filings from January 2009 onwards, this dataset provides key financial figures reported by companies in accordance with U.S. Generally Accepted Accounting Principles (GAAP).

Dataset Features:

Data Scope: The dataset is restricted to figures reported under US-GAAP standards, with the exception of EntityCommonStockSharesOutstanding and EntityPublicFloat.

Currency and Units: The dataset exclusively includes figures reported in USD or shares, ensuring uniformity and comparability. It excludes ratios and non-financial metrics to maintain focus on financial data.

Company Selection: The dataset is limited to companies with U.S. exchange tickers, providing a concentrated analysis of publicly traded firms within the United States.

Submission Types: The dataset only incorporates data from 10-Q, 10-K, 10-Q/A, and 10-K/A filings, ensuring consistency in the type of financial reports analyzed.

Data Sources and Extraction:

This dataset primarily relies on the SEC's Financial Statement Data Sets and EDGAR APIs: - SEC Financial Statement Data Sets - EDGAR Application Programming Interfaces

In instances where specific figures were missing from these sources, data was directly extracted from the companies' financial statements to ensure completeness.

Please note that the dataset presents financial figures exactly as reported by the companies, which may occasionally include errors. A common issue involves incorrect reporting of scaling factors in the XBRL format. XBRL supports two tag attributes related to scaling: 'decimals' and 'scale.' The 'decimals' attribute indicates the number of significant decimal places but does not affect the actual value of the figure, while the 'scale' attribute adjusts the value by a specific factor.

However, there are several instances, numbering in the thousands, where companies have incorrectly used the 'decimals' attribute (e.g., 'decimals="-6"') under the mistaken assumption that it controls scaling. This is not correct, and as a result, some figures may be inaccurately scaled. This dataset does not attempt to detect or correct such errors; it aims to reflect the data precisely as reported by the companies. A future version of the dataset may be introduced to address and correct these issues.

The source code for data extraction is available here
Spanish Stocks Historical Data from 2000 to 2019
kaggle.com
Updated Jun 7, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
alvarobartt (2019). Spanish Stocks Historical Data from 2000 to 2019 [Dataset]. https://www.kaggle.com/alvarob96/spanish-stocks-historical-data/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 7, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
alvarobartt
Description
Introduction

Since Investing.com does not have an API, I decided to develop this Python package in order to retrieve historical data from the companies that integrate the Continuous Spanish Stock Market. So on, I decided to generate, via investpy, the datasets for every company so that any Data Scientist or Data Enthusiastic can handle it and abstract their own conclusions and research.

The main purpose of developing investpy, the package from which these datasets have been retrieved, was to use it as the Data Extraction tool for its namesake section, for my Final Degree Project at the University of Salamanca titled "*Machine Learning for stock investment recommendation systems*". The package end up being so consistent, reliable and usable that it is going to be used as the main Data Extraction tool by another students in their Final Degree Projects named "*Recommender system of banking products*" and "*Robo-Advisor Application*".

License

MIT License

Additional Information

investpy, the Python package from which datasets were generated is currently in a development beta version, so please, if needed open an issue to solve all the possible problems the package may be causing or any dataset error. Also, any new ideas or proposals are welcome, and will be gladly implemented in the package if the are positive and useful.

For further information or any question feel free to contact me via email at alvarob96@usal.es

You can also check my Medium Publication, where I upload weekly posts related to Data Science and mainly on Data Extraction techniques via Web Scraping. In this case, you can read "investpy — a Python package for historical data extraction from the Spanish stock market" where I explain the basics on investpy development and some insights on Web Scraping with Python.

Disclaimer

This Python Package has been made for research purposes in order to fit a needs that Investing.com does not cover, so this package works like an Application Programming Interface (API) of Investing.com developed in an altruistic way. Conclude that this package is not related in any way with Investing.com or any dependant company, the only requirement for developing this package was to mention the source where data is retrieved.
NVidia - Stock Data - Latest and Updated
kaggle.com
Updated Feb 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kalilur Rahman (2025). NVidia - Stock Data - Latest and Updated [Dataset]. https://www.kaggle.com/datasets/kalilurrahman/nvidia-stock-data-latest-and-updated/versions/167
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 2, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Kalilur Rahman
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
https://upload.wikimedia.org/wikipedia/en/thumb/a/a4/NVIDIA_logo.svg/731px-NVIDIA_logo.svg.png" alt="NVidia">

Nvidia Corporation is an American multinational technology company incorporated in Delaware and based in Santa Clara, California.

It designs graphics processing units (GPUs) for the gaming and professional markets, as well as system on a chip units (SoCs) for the mobile computing and automotive market.

Its primary GPU line, labeled "GeForce", is in direct competition with the GPUs of the "Radeon" brand by Advanced Micro Devices (AMD). Nvidia expanded its presence in the gaming industry with its handheld game consoles Shield Portable, Shield Tablet, and Shield Android TV and its cloud gaming service GeForce Now.

Its professional line of GPUs are used in workstations for applications in such fields as architecture, engineering and construction, media and entertainment, automotive, scientific research, and manufacturing design.

In addition to GPU manufacturing, Nvidia provides an application programming interface (API) called CUDA that allows the creation of massively parallel programs which utilize GPUs.They are deployed in supercomputing sites around the world. More recently, it has moved into the mobile computing market, where it produces Tegra mobile processors for smartphones and tablets as well as vehicle navigation and entertainment systems.It recently acquired ARM

# Let us analyze the performance of this solid star!

[CIC-AndMal-2020] Static-Dynamic Malware analysis

kaggle.com

Updated Dec 27, 2021

Facebook

Twitter

Click to copy link

Link copied

Cite

Alberto Zorzetto (2021). [CIC-AndMal-2020] Static-Dynamic Malware analysis [Dataset]. https://www.kaggle.com/datasets/albertozorzetto/cic-andmal-2020-dynamic-static-analysis

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Dec 27, 2021

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Alberto Zorzetto

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Introduction

This dataset contains 200K android malware apps which are labeled and characterized into corresponding family. Benign android apps (200K) are collected from Androzoo dataset to balance the huge dataset. We collected 14 malware categories including adware, backdoor, file infector, no category, Potentially Unwanted Apps (PUA), ransomware, riskware, scareware, trojan, trojan-banker, trojan-dropper, trojan-sms,**trojan-spy** and zero-day.

A complete taxonomy of all the malware families of captured malware apps is created by dividing them into 8 categories such as sensitive data collection, media, hardware, actions/activities, internet connection, C&C, antivirus and storage & settings.

Dataset details

Category	Number of families	Number of samples
Adware	48	47,210
Backdoor	11	1,538
File Infector	5	669
No Category	-	2,296
PUA	8	2,051
Ransomware	8	6,202
Riskware	21	97,349
Scareware	3	1,556
Trojan	45	13,559
Trojan-Banker	11	887
Trojan-Dropper	9	2,302
Trojan-SMS	11	3,125
Trojan-Spy	11	3,540
Zero-day	-	13,340

Static analysis

AndroidManifest.xml contains a lot of features that can be used for static analysis. The main extracted features include:

Activities: An android activity is one screen of the android app's user interface
Broadcast receivers and providers
Metadata: It is basically an additional option to store information that can be accessed through the entire project
The permissions requested by application: It protects the privacy of the user and is needed to access sensitive user data (such as contacts and SMS)
System features (such as camera and internet)

Static Features

Feature	Values
Package Name	"com.fb.iwidget"
Activities	"com.fb.iwidget.OverlayActivity" "org.acra.CrashReportDialog" "com.batch.android.BatchActionActivity" "com.fb.iwidget.MainActivity" "com.fb.iwidget.PreferencesActivity" "com.fb.iwidget.PickerActivity" "com.fb.iwidget.IntroActivity"
Services	"com.batch.android.BatchActionService" "com.fb.iwidget.MainService" "com.fb.iwidget.SnapAccessService"
Receivers/Providers	"com.fb.iwidget.ExpandWidgetProvider" "com.fb.iwidget.ActionReceiver"
Intents Actions	"android.accessibilityservice.AccessibilityService" "android.appwidget.action.APPWIDGET_UPDATE" "android.intent.action.BOOT_COMPLETED" "android.intent.action.CREATE_SHORTCUT" "android.intent.action.MAIN" "android.intent.action.MY_PACKAGE_REPLACED" "android.intent.action.USER_PRESENT" "android.intent.action.VIEW" "com.fb.iwidget.action.SHOULD_REVIVE"
Intents Categories	"android.intent.category.BROWSABLE" "android.intent.category.DEFAULT" "android.intent.category.LAUNCHER"
Permissions	"android.permission.ACCESS_NETWORK_STATE" "android.permission.CALL_PHONE" "android.permission.INTERNET" "android.permission.RECEIVE_BOOT_COMPLETED" "android.permission.SYSTEM_ALERT_WINDOW" "com.android.vending.BILLING" "android.permission.BIND_ACCESSIBILITY_SERVICE"
Meta-Data	"android.accessibilityservice" "android.appwidget.provider"
#Icons	331
#Pictures	0
#Videos	0
Audio files	0
Videos	0
Size of the App	4.2M

Dynamic analysis

For understanding the behavioral changes of these malware categories and families, six categories of features are extracted after executing the malware in an emulated environment. The main extracted features include:

Memory: Memory features define activities performed by malware by utilizing memory.
API: Application Programming Interface (API) features delineate the communication between two applications.
Network: Network features describe the data transmitted and received between other devices in the network. It indicates foreground and background * network usage.
Battery: Batt...

llama-7b.ggmlv3.q4_1.bin
kaggle.com
Updated Oct 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lorentz (2023). llama-7b.ggmlv3.q4_1.bin [Dataset]. https://www.kaggle.com/datasets/lorentzyeung/llama-7b-ggmlv3-q4-1-bin
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 5, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Lorentz
Description
Git LFS Details

Origin: https://huggingface.co/TheBloke/LLaMa-7B-GGML SHA256: bcb95f6755597f26046ab2d5ebea51bf1418f440a96e1563f0fecc379c2cbee3 Pointer size: 135 Bytes Size of remote file: 3.79 GB

Raw pointer file

Git Large File Storage (LFS) replaces large files with text pointers inside Git, while storing the file contents on a remote server. More info.

Meta's LLaMA 7b GGML These files are GGML format model files for Meta's LLaMA 7b.

GGML files are for CPU + GPU inference using llama.cpp and libraries and UIs which support this format, such as:

KoboldCpp, a powerful GGML web UI with full GPU acceleration out of the box. Especially good for story telling. LoLLMS Web UI, a great web UI with GPU acceleration via the c_transformers backend. LM Studio, a fully featured local GUI. Supports full GPU accel on macOS. Also supports Windows, without GPU accel. text-generation-webui, the most popular web UI. Requires extra steps to enable GPU accel via llama.cpp backend. ctransformers, a Python library with LangChain support and OpenAI-compatible AI server. llama-cpp-python, a Python library with OpenAI-compatible API server. These files were quantised using hardware kindly provided by Latitude.sh.

Repositories available GPTQ models for GPU inference, with multiple quantisation parameter options. 2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference "https://huggingface.co/huggyllama/llama-7b">Unquantised fp16 model in pytorch format, for GPU inference and for further conversions Prompt template: None

Compatibility Original llama.cpp quant methods: q4_0, q4_1, q5_0, q5_1, q8_0 These are guaranteed to be compatible with any UIs, tools and libraries released since late May. They may be phased out soon, as they are largely superseded by the new k-quant methods.

New k-quant methods: q2_K, q3_K_S, q3_K_M, q3_K_L, q4_K_S, q4_K_M, q5_K_S, q6_K These new quantisation methods are compatible with llama.cpp as of June 6th, commit 2d43387.

They are now also compatible with recent releases of text-generation-webui, KoboldCpp, llama-cpp-python, ctransformers, rustformers and most others. For compatibility with other tools and libraries, please check their documentation.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Marisol Brewster (2019). MCB_languages_county [Dataset]. https://www.kaggle.com/mcbrewster/mcb-languages-county/code

MCB_languages_county

This dataset was used to list all languages spoken in the United States: 2009-13

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Oct 1, 2019

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Marisol Brewster

Description

Context

This is a dataset I found online through the Google Dataset Search portal.

Content

The American Community Survey (ACS) 2009-2013 multi-year data are used to list all languages spoken in the United States that were reported during the sample period. These tables provide detailed counts of many more languages than the 39 languages and language groups that are published annually as a part of the routine ACS data release. This is the second tabulation beyond 39 languages since ACS began.

The tables include all languages that were reported in each geography during the 2009 to 2013 sampling period. For the purpose of tabulation, reported languages are classified in one of 380 possible languages or language groups. Because the data are a sample of the total population, there may be languages spoken that are not reported, either because the ACS did not sample the households where those languages are spoken, or because the person filling out the survey did not report the language or reported another language instead.

The tables also provide information about self-reported English-speaking ability. Respondents who reported speaking a language other than English were asked to indicate their ability to speak English in one of the following categories: "Very well," "Well," "Not well," or "Not at all." The data on ability to speak English represent the person’s own perception about his or her own ability or, because ACS questionnaires are usually completed by one household member, the responses may represent the perception of another household member.

These tables are also available through the Census Bureau's application programming interface (API). Please see the developers page for additional details on how to use the API to access these data.

Acknowledgements

Sources:

Google Dataset Search: https://toolbox.google.com/datasetsearch

2009-2013 American Community Survey

Original dataset: https://www.census.gov/data/tables/2013/demo/2009-2013-lang-tables.html

Downloaded From: https://data.world/kvaughn/languages-county

Banner and thumbnail photo by Farzad Mohsenvand on Unsplash

Clear search

Close search

Google apps

Main menu

MCB_languages_county

Context

Content

Acknowledgements

Financial Statement Data Sets

Dataset Features:

Data Sources and Extraction:

Spanish Stocks Historical Data from 2000 to 2019

Introduction

License

Additional Information

Disclaimer

NVidia - Stock Data - Latest and Updated

[CIC-AndMal-2020] Static-Dynamic Malware analysis

Introduction

Dataset details

Static analysis

Static Features

Dynamic analysis

llama-7b.ggmlv3.q4_1.bin

MCB_languages_county

This dataset was used to list all languages spoken in the United States: 2009-13

Context

Content

Acknowledgements