Domain: fileformat.info
Stories and comments across the archive that link to fileformat.info.
Comments · 113
-
Re: Too bad slashdot used to cause these
That's a prime character you've used
If you're going to be a pedant on the internet, best do your homework first.
-
Re: Too bad slashdot used to cause these
No, that character I used is the unicode apostrophe character.
Unicode prime is 0x2032I was going to paste in a unicode prime char alongside an apostrophe, but when I preview the post slashdot strips out the prime char.
What you've used in "Here’s" is the unicode right-single-quotation-mark char. https://www.fileformat.info/in...
Code x2019I'm sorry but you're completely wrong.
-
Re: Too bad slashdot used to cause these
What's wrong with using a regular unicode apostrophe?
https://www.fileformat.info/in...What unicode char is OS X using? If it was using apostrophe, it would be perfectly fine.
Here it is again: '
That's a prime character you've used (and that I've used in this sentence too)
The apostrophe character is when you have text substitutions turned on, or something like that. It uses the key on the keyboard which has the single and double quotes on it. The curly apostrophe (smart quotes or typographical quotes) is Opt + ] for the opening single quote and Shift + Opt + ] for the closing single quote, or curly apostrophe: ’
“Here’s the curly apostrophe used in a sentence enclosed in typographical quotes and an ellipsis at the end”
-
Re: Too bad slashdot used to cause these
What's wrong with using a regular unicode apostrophe?
https://www.fileformat.info/in...What unicode char is OS X using? If it was using apostrophe, it would be perfectly fine.
Here it is again: '
-
Re:Good
Quotation mark code points that have been in Unicode for decades (since 1993) aren't "idiosyncratic".
-
Re:Why? Which features?
More like Microsoft decided to preserve existing OEM code pages for CMD.exe even that meant that unicode characters outside those code pages won't display in a command prompt. It's a design decision.
Note that's it's not like this for GUI applications - they all use UCS-2. Or UTF-16 for Windows 2000 or later.
https://msdn.microsoft.com/en-...
Which is means, given a suitable font, your steaming pile of poop emoji U+1F4A9 should display fine in a Win32 GUI app on Windows 2000 or later.
-
Re: It's needed to preserve the battery
On the contrary, it makes me think that every time something is working fine, somebody comes along to change it. Typewriter apostrophe has been around, well, since typewriters!
https://en.wikipedia.org/wiki/...
MS-WORD doesn't even use the same quotation marks for English and French because of those printing inspired people that say that a symbol looks nicer than another depending on the language, establish trends etc. when the used symbol adds no value at all and everybody understands what the symbol means anyway.
MS-Word had problems implementing that functionality first and many people still have problems, it goes from language analyzer to syntax validation software. Here are a few examples after a very quick search:
https://tedclancy.wordpress.co...
http://www.fileformat.info/inf...
https://www.quora.com/Punctuat...
http://snowball.tartarus.org/t...
-
Re:familyâ(TM)s
The reason it appears to not work is because of unicode abuse by commenters.
Yeah, I remember when Unicode worked, and the abuse that came along with it. If
/. wants to filter out non-ASCII characters (or non Latin-1 characters), that's fine, but whatever it's currently doing is broken. There's no case where turning a curly quote into â(TM) is the correct thing to do.It even seems like the code is trying to do something sensible, but just has a simple bug where it's using the wrong character encoding on its input. The Unicode character "RIGHT SINGLE QUOTATION MARK" is encoded as the bytes E2 80 99 in UTF-8. If you interpret those bytes as if they were Windows codepage 1252 characters, you get â, the Euro sign, then the Trademark symbol. Of those, only â is in Latin-1. It looks like Slashdot is trying to convert non-Latin-1 characters to a Latin-1 equivalent, or remove the character if there's no equivalent. So â makes it through, Euro sign is dropped, and the TM symbol gets turned into "(TM)", and you end up with the curly quote turning into "â(TM)". This is basically what GNU iconv does if you use the "//TRANSLIT" suffix on the the destination encoding, except converting to iso-8859-1//TRANSLIT turns the Euro sign into "EUR".
The code just needs to interpret the input as being UTF-8 instead of CP1252, and it should work a lot better. But it's been broken for years, and nobody there wants to fix it.
-
U+5350
Insignia appropriated by National Socialism:
-
U+5350
Insignia appropriated by National Socialism:
-
Re: The takeover has started!
-
or not ...
I call bullshit !
http://www.fileformat.info/info/unicode/char/1f4a9/index.htm -
Re:Bet: Did robots or humans edit this article?
There's a difference between "supports unicode" and "allows all the fucked up shit". Being able to make a standard English character like http://www.fileformat.info/inf... should be supported (did you notice how we have to write "the" with a "th" instead of just using the thorn?), but probably not hippie bullshit like http://www.fileformat.info/inf... though.
-
Re:Bet: Did robots or humans edit this article?
There's a difference between "supports unicode" and "allows all the fucked up shit". Being able to make a standard English character like http://www.fileformat.info/inf... should be supported (did you notice how we have to write "the" with a "th" instead of just using the thorn?), but probably not hippie bullshit like http://www.fileformat.info/inf... though.
-
Re:I'm surprised they actually pulled this off!
That's small potatoes when you could be a smiling poo.
-
Re:SRT Interpretation Comes Into Mainstream Questi
Slashcode stripped the unicode "superscript two" symbol from the atoms of the expressions.
Rewriting:
d(tau)^2 = dt^2 - dr^2/c^2 = invariant
dr^2 = dx^2 + dy^2 + dz^2 -
Well the muslims have their filthy prayer to satan
Well the muslims have their filthy prayer to satan in unicode. That'smuch worse than a condom
-
Re:U+1F926
I prefer U+1F595
-
Re:Official? Hah.
emoji that are specific to Japan that got through Unicode's standardization process (*cough*U+1F5FF*cough*)
Uhh... The Moai statues are from Easter Island. Easter Island is off the coast of Chile, and is approximately 8400 miles (13,500 km) from Japan.
Google Maps has your late-blooming education, right here. You can even right-click that pin mark at Easter Island and use the Measure Distance tool to find out exact distances to various parts of Japan.
-
Re:Unicode 8 support
Thanks for the link. I see that the other critical characters are Taco and Burrito. Slice of Pizza was lonely, maybe? That can't be it, because Hamburger also exists. There are even glyphs for chicken (a drumstick), ribs, and Ramen noodles - glyph says 'steaming bowl', but it's pretty obvious what that is hanging off of those chopsticks. Perhaps Mug of Beer was seeking variety? FYI, this site can give the glyph info and which fonts contain it, but it cannot actually render them yet.
And one wanted to type a Wind Blowing Face, now's the time. Maybe that one's not new. It seems that one is related to a bunch of weather related icons, like fog, cloud with lightning bolt, and cloud with tornado. They seem to be adding lots of these Emoji - I thought there was a Unicode code point shortage? Maybe that's just because UTF-8 because has to maintain backward compatibility with ASCII. From what I understand, in doing so, it wastes a few hundred other code pages. -
Re:Unicode 8 support
Thanks for the link. I see that the other critical characters are Taco and Burrito. Slice of Pizza was lonely, maybe? That can't be it, because Hamburger also exists. There are even glyphs for chicken (a drumstick), ribs, and Ramen noodles - glyph says 'steaming bowl', but it's pretty obvious what that is hanging off of those chopsticks. Perhaps Mug of Beer was seeking variety? FYI, this site can give the glyph info and which fonts contain it, but it cannot actually render them yet.
And one wanted to type a Wind Blowing Face, now's the time. Maybe that one's not new. It seems that one is related to a bunch of weather related icons, like fog, cloud with lightning bolt, and cloud with tornado. They seem to be adding lots of these Emoji - I thought there was a Unicode code point shortage? Maybe that's just because UTF-8 because has to maintain backward compatibility with ASCII. From what I understand, in doing so, it wastes a few hundred other code pages. -
Re:Unicode 8 support
Thanks for the link. I see that the other critical characters are Taco and Burrito. Slice of Pizza was lonely, maybe? That can't be it, because Hamburger also exists. There are even glyphs for chicken (a drumstick), ribs, and Ramen noodles - glyph says 'steaming bowl', but it's pretty obvious what that is hanging off of those chopsticks. Perhaps Mug of Beer was seeking variety? FYI, this site can give the glyph info and which fonts contain it, but it cannot actually render them yet.
And one wanted to type a Wind Blowing Face, now's the time. Maybe that one's not new. It seems that one is related to a bunch of weather related icons, like fog, cloud with lightning bolt, and cloud with tornado. They seem to be adding lots of these Emoji - I thought there was a Unicode code point shortage? Maybe that's just because UTF-8 because has to maintain backward compatibility with ASCII. From what I understand, in doing so, it wastes a few hundred other code pages. -
Re:Unicode 8 support
Thanks for the link. I see that the other critical characters are Taco and Burrito. Slice of Pizza was lonely, maybe? That can't be it, because Hamburger also exists. There are even glyphs for chicken (a drumstick), ribs, and Ramen noodles - glyph says 'steaming bowl', but it's pretty obvious what that is hanging off of those chopsticks. Perhaps Mug of Beer was seeking variety? FYI, this site can give the glyph info and which fonts contain it, but it cannot actually render them yet.
And one wanted to type a Wind Blowing Face, now's the time. Maybe that one's not new. It seems that one is related to a bunch of weather related icons, like fog, cloud with lightning bolt, and cloud with tornado. They seem to be adding lots of these Emoji - I thought there was a Unicode code point shortage? Maybe that's just because UTF-8 because has to maintain backward compatibility with ASCII. From what I understand, in doing so, it wastes a few hundred other code pages. -
Re:Unicode 8 support
Thanks for the link. I see that the other critical characters are Taco and Burrito. Slice of Pizza was lonely, maybe? That can't be it, because Hamburger also exists. There are even glyphs for chicken (a drumstick), ribs, and Ramen noodles - glyph says 'steaming bowl', but it's pretty obvious what that is hanging off of those chopsticks. Perhaps Mug of Beer was seeking variety? FYI, this site can give the glyph info and which fonts contain it, but it cannot actually render them yet.
And one wanted to type a Wind Blowing Face, now's the time. Maybe that one's not new. It seems that one is related to a bunch of weather related icons, like fog, cloud with lightning bolt, and cloud with tornado. They seem to be adding lots of these Emoji - I thought there was a Unicode code point shortage? Maybe that's just because UTF-8 because has to maintain backward compatibility with ASCII. From what I understand, in doing so, it wastes a few hundred other code pages. -
Re:Unicode 8 support
Thanks for the link. I see that the other critical characters are Taco and Burrito. Slice of Pizza was lonely, maybe? That can't be it, because Hamburger also exists. There are even glyphs for chicken (a drumstick), ribs, and Ramen noodles - glyph says 'steaming bowl', but it's pretty obvious what that is hanging off of those chopsticks. Perhaps Mug of Beer was seeking variety? FYI, this site can give the glyph info and which fonts contain it, but it cannot actually render them yet.
And one wanted to type a Wind Blowing Face, now's the time. Maybe that one's not new. It seems that one is related to a bunch of weather related icons, like fog, cloud with lightning bolt, and cloud with tornado. They seem to be adding lots of these Emoji - I thought there was a Unicode code point shortage? Maybe that's just because UTF-8 because has to maintain backward compatibility with ASCII. From what I understand, in doing so, it wastes a few hundred other code pages. -
Re:Improvements?
Fedora is free software, and Red Hat uses it to see what will get pulled into their Red Hat / Cent OS distros. Vendor lock in? What on earth vendor lock in is implied here?
Also, how dare you say a hot dog is a non essential character?
http://www.fileformat.info/inf... -
Re:Emojis are for cows.
I am somewhat impressed that at least one moderator was apparently able to pick up on this joke without any hints. Although I suppose it's also possible that the joke wasn't really that funny.
I probably could have made it a little less esoteric by explicitly linking to U+1F404 and U+1F42E, but in all honesty, that didn't occur to me at the time.
-
Re:Emojis are for cows.
I am somewhat impressed that at least one moderator was apparently able to pick up on this joke without any hints. Although I suppose it's also possible that the joke wasn't really that funny.
I probably could have made it a little less esoteric by explicitly linking to U+1F404 and U+1F42E, but in all honesty, that didn't occur to me at the time.
-
What bug?
The character in question is Hiragana "No", codepoint U+306E. As far as I can tell, this has existed since Unicode 1.1 and there are no differences in the Unicode metadata when compared to any other Hiragana glyph. It is marked as IsAlphabetic=True, Category=Other Letter, and NumbericType=None for example. So are all the other common Hiragana glyphs. If there is a bug, it's clearly with some specific application, and not Unicode or Unicode metadata. Compare http://www.fileformat.info/inf... with any other Hiragana glyph, like http://www.fileformat.info/inf... (Hiragana "Ha").
-
What bug?
The character in question is Hiragana "No", codepoint U+306E. As far as I can tell, this has existed since Unicode 1.1 and there are no differences in the Unicode metadata when compared to any other Hiragana glyph. It is marked as IsAlphabetic=True, Category=Other Letter, and NumbericType=None for example. So are all the other common Hiragana glyphs. If there is a bug, it's clearly with some specific application, and not Unicode or Unicode metadata. Compare http://www.fileformat.info/inf... with any other Hiragana glyph, like http://www.fileformat.info/inf... (Hiragana "Ha").
-
Re:Unicode is badly designed
They not only differ in shape (though to your eyes they *look the same*)
It's very clear that they don't differ in shape if you open Latin A and Greek Alpha in two different tabs and switch back and forth between them.
-
Re:Unicode is badly designed
They not only differ in shape (though to your eyes they *look the same*)
It's very clear that they don't differ in shape if you open Latin A and Greek Alpha in two different tabs and switch back and forth between them.
-
Re:Unicode is badly designed
I know, but read its comments section and cry.
-
Re:Lol
"No you are wrong."
Pretty sure I'm not. We could just claim that way back and forth, but lets go over this:
Here's what you said:
"Go X bytes into the string. If that byte is a continuation byte, back up. Back up a maximum of 3 times. This will find a truncation point that will not introduce more errors into the string than are already there."
Here's what I said:
"This only works for UTF-8, and theoretically fails with the older type of UTF-8 (when you could have up to 6 bytes, by spec). So you probably will have to go through it character by character, not byte by byte, exactly as Brons said."
So pretend you have a 12 character display. Your method, for UTF-8:
> Checks to see if the input is 12 or less bytes, and displays it fine (this works)
> If not, it goes to that 12th byte, then checks it to see if it is a continuation byte (a byte which, when ANDed with 0xC0, is equal to 0x80)
> If it is a continuation byte, and we haven't seen three in a row yet, increment the number seen, and back up one byte.
> If we found a non-continuation byte or we have seen three continuation bytes in a row, then what we are looking at must be a starter byte.
> Write four bytes beginning with the overwriting the starter byte: 0xE2 0x80 0xA6 0x00 (ellipsis, null character)With this method, you definitely could have left some garbage to the right of the null (if that null ate anything to the right of that), but that's ok because the null ends the stream (if it doesn't, you'll need to pad some more nulls). An alternate method that doesn't stamp the null is vastly worse, as if you were finding a two byte character to stamp the three byte elipsis into, you would have eaten the first byte of the NEXT multibyte sequence, leaving you with an illegal data stream, and no null to tell the next guy to stop.
But, anyway, this one works- like I said- but I claimed that it had two problems- "only for UTF-8" and "results in a VERY short message for some inputs". It also trivially fails for the pre-RFC-3629 UTF-8 standard, but I guess we are ok with that (that version can have up to five continuation bytes).
If your message was, lets say, 8 of the "smiling face with smiling eyes" emojis:
http://www.fileformat.info/inf...
(or equivalent 4 byte characters)The algorithm of "go 12 bytes in" will skip past the first two entire "0xF0 0x9F 0x98 0x8A" sequences, landing on the "0x8A" one of the last one. The algorithm will detect that this is a continuation byte, and back up the max times (through the 0x98, and 0x9F), landing on (and stamping over) the 0xF0 initial byte. But this means that your output message is:
(happy face)(happy face)(ellipsis)
You took a 12 character display AND LIMITED IT TO TWO CHARACTERS. When in fact, the original message would have fit, if you did what Brons said.
Because you searched in N bytes, instead of doing what Brons said (and that you even fucking called "MORONIC"), you fucked your hypothetical user AND insulted the guy with the right answer at the meeting (or were at least rude to him, brusque, or superior without cause).
But, lets continue.
I also claimed that this "only works for UTF-8". This is pretty trivially true- you explicitly refer to "continuation bytes", which are definitely not present in all encoding methods. UTF-16 is either one or two 16-bit words, and these are not "continuation bytes". With such an input, you would go 2*N forward, and then check for if the word sequence found was whichever surrogate comes first in your byte ordering (ex, you might be looking to see if it is a high surrogate, and therefore the start of a character, if your byte stream has that ordering), and if not, back up one word to find the guaranteed start of character, and then stamp over that with your elipsis. This is the general equivalent of your UTF-8 solution, but you still dramatically shorten what your user can
-
Re:WTF is going on in USA
I think he was trying to say "Oh Shit!"
-
Re:What's the UTF-8 encoding of THAT?
As a native Chinese speaker, I can assure you that even the simpliest Chinese character ("", meaning "one", http://www.fileformat.info/inf... ) cannot be found in known online md5 hash dictionaries. So if Chinese characters (or any non-combining Unicode characters) are allowed in password boxes, we asian guys can create very-easy-to-remember-but-very-hard-to-brute-force passwords since their entroy is bloody high compared to printable ASCII characters. And a friend of mine hacked his Chromium to allow Unicode characters to be input into password boxes.
:-) -
Re:What's the UTF-8 encoding of THAT?
Um. Wow. That is all.
Actually, no, in Unicode 7.0 there's even more.
-
Re:What's the UTF-8 encoding of THAT?
If by "that" you mean "a fecal sample", the Unicode encoding is U+1F4A9.
-
Re:Next wave of phishing?
That kind of phishing already exists, even more sophisticated: a bug that a lot of software contains is not distinguishing between same looking characters in different alphabets. E.g. you can sign up on many forum/bbs platforms as Administrator if your leading A is cyrillic A instead of latin A. Both look the same but have different html entity codes and are different unicode chracatres, which is true for most vowels and many consonants (e.g. cyrillic B and latin B, C and C, E and E...). Or, for more fun, look at this (single) character which looks exactly as "lj".
Those of us with customers who use two alphabets constantly have known about this problem for a long time and we've seen phishing on all different kinds of platforms using this strategy.
IDN (internationalized domain names) solves this problem in domain names with policy: you can't register a domain which looks exactly like some other domain except for that change in character. Still though, you can register both casino.it and casinò.it and that's where the real phishing potential is. I think, at least most native English speakers, would probably be fooled easier by a domain such as paypal-customer-division.com than paypàl.com. -
Re:Next wave of phishing?
That kind of phishing already exists, even more sophisticated: a bug that a lot of software contains is not distinguishing between same looking characters in different alphabets. E.g. you can sign up on many forum/bbs platforms as Administrator if your leading A is cyrillic A instead of latin A. Both look the same but have different html entity codes and are different unicode chracatres, which is true for most vowels and many consonants (e.g. cyrillic B and latin B, C and C, E and E...). Or, for more fun, look at this (single) character which looks exactly as "lj".
Those of us with customers who use two alphabets constantly have known about this problem for a long time and we've seen phishing on all different kinds of platforms using this strategy.
IDN (internationalized domain names) solves this problem in domain names with policy: you can't register a domain which looks exactly like some other domain except for that change in character. Still though, you can register both casino.it and casinò.it and that's where the real phishing potential is. I think, at least most native English speakers, would probably be fooled easier by a domain such as paypal-customer-division.com than paypàl.com. -
Re:Next wave of phishing?
That kind of phishing already exists, even more sophisticated: a bug that a lot of software contains is not distinguishing between same looking characters in different alphabets. E.g. you can sign up on many forum/bbs platforms as Administrator if your leading A is cyrillic A instead of latin A. Both look the same but have different html entity codes and are different unicode chracatres, which is true for most vowels and many consonants (e.g. cyrillic B and latin B, C and C, E and E...). Or, for more fun, look at this (single) character which looks exactly as "lj".
Those of us with customers who use two alphabets constantly have known about this problem for a long time and we've seen phishing on all different kinds of platforms using this strategy.
IDN (internationalized domain names) solves this problem in domain names with policy: you can't register a domain which looks exactly like some other domain except for that change in character. Still though, you can register both casino.it and casinò.it and that's where the real phishing potential is. I think, at least most native English speakers, would probably be fooled easier by a domain such as paypal-customer-division.com than paypàl.com. -
Re:Why emoji?
streaming turd makes sense, but why is there a Moon viewing ceremony character. I can't think of any good reason for that.
-
Re:Why emoji?
What's the point of adding pictographic symbols to Unicode?
Hear here.
When I'm sorting text, it's important to know how individual symbols relate -- is A before or after $? -- but I don't want to need to give a flying fart whether A comes before or after winky-smile and whether that comes before or after steaming turd. (No, really, there is a steaming turd character.)
-
huffman error correction was in fax machines
so this really doesn't seem like a 'breakthrough'. It's just a new application of existing technologies. http://www.fileformat.info/mir...
-
I'd like U+1F4A9 please
So, if unicode characters are now a legitimate part of website names, I'd like to register a new domain:
http://www.fileformat.info/info/unicode/char/1f4a9/index.htm
Imagine all the fun I could have with it: microsoft.pile-of-poo, oracle.pile-of-poo, mostgovernmentrepresentatives.pile-of-poo and so on. It would make blogging so much more satisfying. Who wants to be a dot-com anymore? So 90s. Be poop instead!
-
Re:why dont we just use chinese characters?
Been there, done that. Look specifically at APL in the 60s. Functions were represented by single characters which you needed a special keyboard to type. For example, instead of typing the string floor, instead it was represented by what is now Unicode Character 'LEFT FLOOR' (U+230A) and required a special terminal to reproduce them. This limited where you could input and also display APL code.
One evolution of APL was the A+ language leading finally to K in the 90s. Having these special character requirements was too much of a pain in APL so all special characters were replaced by tuples of ASCII characters that were already common. In K, 'floor' was now expressed as _: which is no easier to guess the meaning of if you don't know the syntax, but now you need only standard ASCII to represent it.
'Son of K' was Q which comes full circle replacing _: with the keyword floor. Iverson's argument in developing APL was that the terseness achieved by using notoation (single characters) meant that you could express concepts more conciesely. This in turn meant that complex concepts were easier to visualise. There's a lot to be said for this, but I think Q now provides a much happier medium between the two perspectives. -
Re:Toys(U+042F)us
You mean Toys U+1D19 Us. This isn't Soviet Russia yet!
The way Slashdot filters out most non-ASCII characters in posts is lame. It dates back to before they started used UTF8 encoding and long since stopped making sense.
-
Toys(U+042F)usIsn't it
Toys
(U+042F)
us?
-
Re:Huh.
Both the tool and you, make a (I assume wilfully) massively ignorant assumption: You assume ASCII or any similar 8 bit character set and matching keyboard.
Except for the US maybe, we all have Unicode with 110182(!) characters. Not 255. Let alone 127.I just add a random true math symbol, double arrow or something in there (let alone a re-mapped skull and crossbones), and voila, it turns out your pseudo-argument is ignorant nonsense.
-
Re:Even a broken clock
You know, there is an actual character for the standard mathematical "not equals" sign. "=/=" is OK, but at least to me the visual similarity to the standard symbol is not obvious.