暫無描述

mosix 2482814c56 about unicode 2 年之前
api 3bd5344e4a add subscription to api client 2 年之前
irc 0a366e5b1f add irc client 2 年之前
.env.example cfcea73197 add example .env file 2 年之前
.gitignore 50345dbd08 remove .idea and ignore __pycache__ 2 年之前
README.md 2482814c56 about unicode 2 年之前
main.py a6864534ce main ist still testing shit for me 2 年之前
message.py bc5b62b036 hmac message verificytion 2 年之前
request_oauth_token.py f569eab3a9 script to request oauth token for irc client use 2 年之前

README.md

TwitchChatBot

Before jumping into links, read the text. This is still in development. Let's call it a playground.

First of all twitch has an api. The client can do some stuff, but not enough. The class used for this is the TwitchApiClient(client_id, client_secret). See main.py which i think should be __main__.py. Pls check.

The idea is to find a channel and subscribe to message events. The api client only needs clientId and clientSecret. Then we need to send a message. This is where i started. than i found out they use irc chat.

Twitch is great, because they use irc chat. For irc chat we need an oauth token. The irc client should login. thats all it can do for now. the irc client needs a token with permission for stuff.

TL;DR

Starting with:

  • See your scopes in .env. (may add chat:edit)
  • use the request_token.py script.
  • copy and open url with some browser
  • after redirect to localhost, copy the token from url

  • start telnet with command: telnet

  • before connect you may want to enable seeing your input characters too

  • then connect

    • https: open irc.chat.twitch.tv 6697 (you must be genius via telnet)
    • http: open irc.chat.twitch.tv 6667 (use this for telnet only)
    • NOTE: you can also open directly telnet url port
  • now you have an open tcp connection.

Every character you type will be send and decoded UTF-8 (i will write some below) on server side at the moment you type it. So when you hit the delete button. the character is transmitted. it will not delete any text. the server will not do anything. i mean sometimes he was saying something like: hey you!!!!. Which is nice, so you see, there is a connection. thanks for responding to bullshit.

  • first we need to login
  • use the script request_oauth_token.py to get a token for irc login
  • first source the .env file: source .env
  • call script: chmod u+x request_oauth_token.py && python3 request_oauth_token.py
  • first transmitted characters are the authentication method
  • type: PASS oauth:<yourToken>
  • then hit enter for transmitting \n
  • next send your username by typing: NICK <username>
  • send \n via Enter
  • next try joining a channel via /JOIN <channelname>
  • watch the output
  • maybe try sending a private message PRIVMSG

If you use the irc client, you have tls support. there is a wrapper for the tcp socket.

Next thing to be added to the code is sending a message. Should be PRIVMSG. Read for commands and PRIVMSG with result.

Useful links:

About UTF8 and encoding

Lets start with some bytes we have received. They are just random 0 and 1. Encoding means, which value maps to which letter. For example 0100 0001 matches the letter A in UTF-8. That is why you can not use a bytes object as a string. you need to decode it. so you must know the format or have some crazy detection code.

UTF-8 is multibyte encoding. A normal character (char) would be size of a byte (0000 0000 8 bits). So maximum possible number combinations is 2^8 = 256. 0 is a number too (0000 0000 is also a combination). so the highest value is 255 (but they start with 0 so 2^7, continue reading). Anyway.

A normal character that uses only one byte will always start with a 0. so 0101 1101 is a one byte character (dunno which char that is). So the highest value for a 1byte char is 2^7-1=127 (because the first bit is used to say its a normal char). UTF-8 is also compatible (or somewhatever you wanna call it) with ASCII. So all ASCII chars map in UTF-8 to the same character. ASCII starts always with 0.

Multybyte characters (remember 8bit = 1byte)

They simply use the first two bytes for something else. They say:

  • if it starts with 0: single byte char
  • if its 10 → a following multibyte
  • if its 11 → marker multibyte (with unused 6 bits for value)

so the logic is:

  • read byte by byte
  • if 0 → normal
  • if 11 → multibyte start
    • if next is 10 read left 6 bits
    • if next is 0 or 11 this is a new char

Have an example:

  • mkdir /tmp/test && cd/tmp/test
  • echo "🙏" > example.txt
  • xxd example.txt

Output: 00000000: f09f 998f 0a .....

So how do we read bytes. This is hex representation. hex has the base 16. but let's not discuss number systems. hex uses the chars 0-9 and A-F. You can just continue counting.

  • 0 = 0 * 16^0 = 0
  • 9 = 9 * 16^0 = 9
  • A = 10 * 16^0 = 10
  • F = 15 * 16^0 = 15

With more digits:

  • 5F = (5 * 16^1) + (15 * 16^0) = 5*16 + 15*1 ...
  • Like: [16^4][16^3][16^2][16^1][16^0]
  • Same in binary: [2^4][2^3][2^2][2^1][2^0]

How to read this easy: gnome-calculator in programming mode. What I wanted to say is, that a byte is represented as 2 hex chars. Because F = 1111. And we need two.

So the output in binary is: 1111 0000 1001 1111 1001 1001

Translates to:

  • 1111 0000 multibyte start
  • 1001 1111 multibyte read next
  • 1001 1001 multibyte read next
  • 1000 1111 multibyte read next
  • 0000 1010 new simple char

we have two chars. First one (X = marker multibyte read):

And 0000 1010 = 10(dec) = LF (\n). Use echo -n.

me thinking They could have used the first byte too.

Unicode

The value is then searched in the used font. Those fonts can have different formats, because you can save the shape of a letter or emoji in different formats. The font needs to have the character. But in general the encoding is saying: How to understand that value. For example the character \r says: move cursor back to line start. It will not be represented in a font.

So what is unicode:

The Unicode Standard is the universal character encoding standard for written characters and text.

And

Unicode characters are represented in one of three encoding forms: a 32-bit form (UTF- 32), a 16-bit form (UTF-16), and an 8-bit form (UTF-8). The 8-bit, byte-oriented form, UTF-8, has been designed for ease of use with existing ASCII-based systems.

I think, an encoding that by definition uses several encodings (just search things in fonts) should not be called encoding. More like:

Unicode defines shapes that represent a value encoded in utf-8, utf-16 or utf-32. Those shapes are defined by the used font.

References: