{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# šŸ“ˆ Stock Market BPE Tokenizer - Usage Examples\n", "\n", "This notebook demonstrates how to use the Stock Market BPE tokenizer.\n", "\n", "## šŸŽÆ What You'll Learn\n", "- How to load the trained tokenizer\n", "- How to encode stock data\n", "- How to decode tokens back to text\n", "- How to calculate compression ratios\n", "- Real-world examples with actual stock data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## šŸ“¦ Setup" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from tokenizer import StockBPE\n", "import json" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## šŸ”§ Load the Trained Tokenizer" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Initialize and load the trained tokenizer\n", "tokenizer = StockBPE()\n", "tokenizer.load(\"stock_bpe\")\n", "\n", "print(f\"āœ… Tokenizer loaded!\")\n", "print(f\"šŸ“š Vocabulary size: {len(tokenizer.vocab):,}\")\n", "print(f\"šŸ”€ Number of merges: {len(tokenizer.merges):,}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## šŸ“Š Example 1: Encode a Single Stock Record" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Sample stock data\n", "stock_data = \"AAPL|2024-01-15|150.25|152.30|149.80|151.50|1000000\"\n", "\n", "print(\"šŸ“ˆ Original Stock Data:\")\n", "print(stock_data)\n", "print(f\"\\nšŸ“ Length: {len(stock_data)} characters\")\n", "\n", "# Encode\n", "tokens = tokenizer.encode(stock_data)\n", "print(f\"\\nšŸ”¤ Encoded Tokens:\")\n", "print(tokens)\n", "print(f\"\\nšŸ“Š Token count: {len(tokens)}\")\n", "\n", "# Calculate compression\n", "original_bytes = len(stock_data.encode('utf-8'))\n", "compression_ratio = original_bytes / len(tokens)\n", "print(f\"\\nšŸ—œļø Compression ratio: {compression_ratio:.2f}x\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## šŸ”„ Example 2: Decode Tokens Back to Text" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Decode the tokens back to original text\n", "decoded = tokenizer.decode(tokens)\n", "\n", "print(\"šŸ”“ Decoded Text:\")\n", "print(decoded)\n", "\n", "# Verify it matches the original\n", "print(f\"\\nāœ… Match: {stock_data == decoded}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## šŸ“Š Example 3: Multiple Stock Records" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Multiple stock records\n", "multi_stock = \"\"\"AAPL|2024-01-15|150.25|152.30|149.80|151.50|1000000\n", "MSFT|2024-01-15|380.50|385.20|379.00|384.75|850000\n", "GOOGL|2024-01-15|140.10|142.50|139.80|141.90|920000\"\"\"\n", "\n", "print(\"šŸ“ˆ Multiple Stock Records:\")\n", "print(multi_stock)\n", "\n", "# Encode\n", "tokens = tokenizer.encode(multi_stock)\n", "print(f\"\\nšŸ”¤ Total tokens: {len(tokens)}\")\n", "\n", "# Compression\n", "ratio = tokenizer.calculate_compression_ratio(multi_stock)\n", "print(f\"šŸ—œļø Compression ratio: {ratio:.2f}x\")\n", "\n", "# Decode and verify\n", "decoded = tokenizer.decode(tokens)\n", "print(f\"\\nāœ… Decoding successful: {multi_stock == decoded}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## šŸŽÆ Example 4: Analyze Compression Patterns" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Test different types of stock data\n", "test_cases = [\n", " (\"Single record\", \"AAPL|2024-01-15|150.25|152.30|149.80|151.50|1000000\"),\n", " (\"High price\", \"GOOGL|2024-01-15|2800.50|2850.20|2790.00|2845.75|500000\"),\n", " (\"Low price\", \"F|2024-01-15|12.50|12.80|12.30|12.75|5000000\"),\n", "]\n", "\n", "print(\"šŸ“Š Compression Analysis:\\n\")\n", "for name, data in test_cases:\n", " ratio = tokenizer.calculate_compression_ratio(data)\n", " tokens = len(tokenizer.encode(data))\n", " print(f\"{name:15} | Ratio: {ratio:.2f}x | Tokens: {tokens}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## šŸ” Example 5: Inspect Learned Patterns" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Show some learned merge patterns\n", "print(\"🧠 Sample Learned Patterns:\\n\")\n", "\n", "# Get first 10 merges\n", "for i, ((p0, p1), idx) in enumerate(list(tokenizer.merges.items())[:10]):\n", " try:\n", " pattern = tokenizer.vocab[p0].decode('utf-8', errors='ignore') + \\\n", " tokenizer.vocab[p1].decode('utf-8', errors='ignore')\n", " print(f\"Merge {i+1}: '{pattern}' -> Token {idx}\")\n", " except:\n", " print(f\"Merge {i+1}: Bytes ({p0}, {p1}) -> Token {idx}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## šŸ“ˆ Example 6: Real-World Usage Simulation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Simulate processing a day's worth of stock data\n", "daily_data = \"\"\"AAPL|2024-01-15|150.25|152.30|149.80|151.50|1000000\n", "AAPL|2024-01-16|151.60|153.20|151.00|152.80|1200000\n", "AAPL|2024-01-17|152.90|154.50|152.00|153.75|980000\n", "MSFT|2024-01-15|380.50|385.20|379.00|384.75|850000\n", "MSFT|2024-01-16|385.00|388.50|384.00|387.25|920000\n", "MSFT|2024-01-17|387.50|390.00|386.50|389.50|880000\"\"\"\n", "\n", "print(\"šŸ“Š Processing Daily Stock Data\\n\")\n", "print(f\"Original size: {len(daily_data)} characters\")\n", "\n", "# Encode\n", "tokens = tokenizer.encode(daily_data)\n", "print(f\"Tokenized: {len(tokens)} tokens\")\n", "\n", "# Calculate savings\n", "original_bytes = len(daily_data.encode('utf-8'))\n", "token_bytes = len(tokens) * 2 # Assuming 2 bytes per token\n", "savings = (1 - token_bytes / original_bytes) * 100\n", "\n", "print(f\"\\nšŸ’¾ Storage Savings:\")\n", "print(f\" Original: {original_bytes} bytes\")\n", "print(f\" Tokenized: {token_bytes} bytes\")\n", "print(f\" Savings: {savings:.1f}%\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## šŸŽ“ Summary\n", "\n", "### āœ… What We Learned\n", "- How to load and use the Stock Market BPE tokenizer\n", "- Encoding stock data into tokens\n", "- Decoding tokens back to original format\n", "- Calculating compression ratios\n", "- Analyzing learned patterns\n", "\n", "### šŸ“Š Key Metrics\n", "- **Vocabulary Size:** 5,500+ tokens\n", "- **Compression Ratio:** 3.5x average\n", "- **Accuracy:** 100% (lossless encoding/decoding)\n", "\n", "### šŸš€ Next Steps\n", "- Use this tokenizer in ML models for stock prediction\n", "- Compress large financial datasets\n", "- Analyze learned patterns for market insights\n", "\n", "---\n", "\n", "**Happy tokenizing! šŸ“ˆšŸ¤–**" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.0" } }, "nbformat": 4, "nbformat_minor": 4 }