Cessen's Ramblings

2020 - 07 - 27

Foreign Language Dictionaries

In this post I'm going to make the case that foreign language dictionaries (e.g. Japanese-English dictionaries) are actually thesauruses, not dictionaries, and that this has important implications for how you use them when learning a language.

Native dictionaries vs foreign language dictionaries

When you look up a word in a native English dictionary, you typically get descriptive definitions. For example, the first two definitions of the Wiktionary entry for "bathe" are:

  1. To clean oneself by immersion in water or using water; to take a bath, have a bath.
  2. To immerse oneself, or part of the body, in water for pleasure or refreshment; to swim.

Foreign language dictionaries, on the other hand, typically provide a list of similar words or short phrases. For example, this is the entry for "やっぱり" from a popular Japanese-English dictionary:

  1. too; also; likewise; either​
  2. still; as before​
  3. even so; either way; after all; nonetheless; in any event; all the same​
  4. as expected​

This is much closer to the behavior of a thesaurus than it is to a dictionary. Thesauruses provide a list of words that are similar to the word you're looking up, and that's pretty much what's happening here.

I don't think this is necessarily a bad thing, but it's not what's advertised on the cover. They aren't called "foreign language thesauruses", after all. I'm also not saying that words should always be defined descriptively. On the contrary, there are many cases where providing a simple translation is absolutely the best approach. For example, defining "猫" as "cat" is far more succinct—and probably less confusing—than "a domesticated subspecies of feline animal, commonly kept as a house pet."

But the problem is that simple precise translations often don't exist. And in those cases you're left with a list of loosely related words that only provide a vague sense of the word's actual meaning. An extreme example of that is "やっぱり" above, which is nearly impossible to directly translate into English. Its Japanese-English entry is therefore a confusing mess of words and expressions that at best fail to convey its meaning and at worst are outright misleading. If you're trying to learn Japanese, that entry is close to useless. It really needs a proper descriptive definition.

However, that's not really a representative example. Most words aren't quite so difficult to translate, and the issues that manifest are usually more subtle. So let's look at a more typical example that actually has some close-ish English equivalents.

Here is the Japanese-English dictionary entry for "反省":

  1. reflection; reconsideration; introspection; meditation; contemplation​
  2. regret; repentance; remorse; being sorry

And here is a corresponding native Japanese dictionary entry for the same word:

  1. 自分の言動を振り返って考えること.
    (Rough translation: "Thinking back on one's words and actions.")

(Disclaimer: the translation is mine, so don't trust it too much. Also, this is just one dictionary, and dictionaries can vary. This is for example purposes only!)

One obvious difference is that the Japanese-English dictionary lists two meanings, whereas the native dictionary only lists one. As you might guess, the native dictionary is right: this word only has one meaning.

But the difference I want you to notice—and the one most relevant to this post—is, again, that the native dictionary gives a descriptive definition. Moreover, the Japanese-English dictionary fails to fully capture that definition in its list of words. It hints around it, and some of the words are pretty close. But none of them are quite the same. And it's in cases like this that you really want a descriptive definition, not a list of words that amount to a thesaurus entry.

This is just one example, of course. But this kind of thing happens all the time in Japanese-English dictionaries, and I suspect is common among foreign language dictionaries in general. And it happens because these dictionaries are trying to provide translations rather than definitions, even when precise translations don't exist.

Why does this matter?

In some Japanese language learning circles they talk about making a so-called "monolingual transition", where you switch from using a Japanese-English dictionary to using a native Japanese dictionary. The typical stated benefit of doing this is that from that point onward you are learning Japanese entirely in Japanese, and that this accelerates your progress. I think there is some truth to that: the more time you spend in a language, the faster your progress will be.

But I think there is a far more substantial benefit to using a native dictionary, which is simply that it's a real dictionary rather than a thesaurus. This gives you the enormous advantage of proper definitions, irrespective of what language they're written in. And especially for languages as different as Japanese and English, that will give you a much clearer picture of what words mean.

Of course, even native dictionaries don't provide a complete picture, which is easy to see in any native English dictionary. Things like the feeling of a word, the kinds of contexts it's usually used in, and its various connotations are rarely fully reflected even in native dictionaries. So no matter what, you still need to observe words being used in real contexts to build a complete understanding of them.

But dictionaries nevertheless provide a leg up on getting to that complete understanding faster. And the better the dictionary, the more of a leg up you get. And... thesauruses aren't very good dictionaries.

Advice for language learners

What if you're learning a language? What can you do? After all, especially when you're in the early stages, you can't reasonably use a native dictionary.

From my experience so far (which is admittedly limited—take this with a grain of salt), the most important thing is simply to be aware of this. Whenever you read a "definition" in a foreign language dictionary, keep in mind that it's actually a list of related words. The real meaning of the word is likely a bit different. As long as you're aware of that, you can learn the word's real meaning over time as you encounter it in real usage. And the thesaurus-style definition still gives you a leg up, because you at least have a starting point.

The second thing you can do is to start using a native dictionary as early as possible. That doesn't mean using it exclusively. Just get in the habit of looking up words in both a foreign language dictionary and a native dictionary. If you can't understand the native entry, no worries! But if you can, it will provide a much clearer picture of the word's real meaning. And over time, as your ability in the language grows, you can rely more and more on native dictionaries.

Lastly, I think there's a useful mental shift you can make when learning foreign words: rather than thinking you're learning e.g. the Japanese version of an English word, instead assume that you're learning an entirely new word altogether. Just as if you were learning a new word in your native language. There will certainly be times—especially with basic words—when, in fact, there is an essentially identical word in your native language. But it's nevertheless a useful frame of mind to have, to assume that most words you encounter are entirely new. And especially with languages as different as Japanese and English, that assumption will very often turn out to be right.

2020 - 04 - 03

Ropey: Things I Would Do Differently

Ropey is a utf8 text rope library for Rust that I wrote and maintain. Initially I created it just for use in my own text editor Led, but early on I decided to split it out into a separate library since it seemed generally useful. Last year I finally released Ropey 1.0, and since then I've had time to reflect on some of the design decisions I made, and what I do and don't like about Ropey.

To be clear, I don't expect to create a Ropey 2.0. At least, not any time soon. And even if I do, I will still maintain the 1.x release indefinitely. I strongly believe that stability is important for published code that you want others to use, and rapid-fire releasing new major versions with breaking changes is a great way to undermine that.

With that said, here are the things I would go back and change if I could.

Unicode scalar values are the primary indexing metric.

When I started Ropey, I had this grand plan to make a Unicode text rope that basically hides the underlying encoding, and just lets you deal with scalar values. Thus, all of Ropey's primary APIs, including its slicing APIs, operate in terms of Unicode scalar values rather than bytes.

I now believe that was the wrong decision.

First of all, it turns out that it's just not that useful in practice. It seems like a nice idea at first, because hey, Unicode scalar values are the atomic unit of Unicode text, right? But the reality is that because of e.g. graphemes, there isn't much of a practical difference between bytes and scalar values for any kind of real thing you might want to do. If you want to report character counts to the user, for example, you'll want to use some concept of graphemes, not scalar values. And if you want to insert text at a given location, any reasonably efficient indexing metric will do.

Second, it costs performance and space to keep track of scalar values. Every time you insert text, for example, Ropey has to scan the inserted text to count its scalar values. The same goes for removing text. And also internally when Ropey splits leaf nodes. Etc. And you have to store those scalar value counts in the nodes. And it's all just a bunch more code to maintain in Ropey for relatively little payoff.

Third, sometimes you need to operate in terms of bytes. There are a lot of situations where you can't reasonably be encoding agnostic. So Ropey needs to provide byte-oriented API's anyway. Thus, by using scalar values, all I really did was increase Ropey's API surface area without much practical application.

Long story short: if I were to do it over again, I would remove all the scalar value oriented APIs (the Chars iterator notwithstanding).

Having said all that, I don't think this was a huge blunder or anything. And it does make Ropey easier to jump into for casual use. But I don't think it pulls its weight.

Ropey includes split and concatenate operations.

One of the things that ropes are really good at is split and concatenate operations. These are even listed as fundamental operations on e.g the Wikipedia page about ropes. Thus, I implemented them for Ropey.

However, much like scalar value indexing, in practice these operations are pretty much useless for the kind of work Ropey is designed for. In a text editor, you rarely split a text buffer into two pieces, or merge two buffers together. And even in the rare cases where you need to do that, it's never done as a singular operation, so you can't really use these operations for it anyway.

And the down side of including these operations in Ropey is twofold:

  1. Again, increased API surface area.
  2. The code for these operations is surprisingly complicated because of the B-tree rebalancing that has to happen.

For both of those reasons, I wish I just hadn't included these operations in Ropey to begin with. It makes the API messier, and it's a bunch of gnarly code that I have to maintain now for no good reason.

The only up-side to having this code in Ropey is that it allows very large text insertions/removals to be done in O(M + log N) time (where M is the size of the text being inserted/removed) rather than O(M log N), because I re-use the split/concatenate code to special-case that. But small insertions/removals are the typical case, and very large insertions would still be reasonably fast (the constant factor on M is pretty low in both cases).

In short, I also don't think this code is pulling its weight.

Leaf nodes and internal nodes are the same size.

This is actually one that can be changed in a completely backwards compatible way, so I may yet change this in a future 1.x release. But doing so will be a pain, and I wish I had done this differently from the start.

The current design of Ropey requires that internal nodes and leaf nodes occupy roughly the same number of bytes as each other. I don't think this was a horrible decision, but it did box me in a bit, and I believe Ropey could be made even faster if this requirement were relaxed.

Specifically, if I want to play with the size of the leaf nodes to see how that affects performance, it necessarily also changes the internal node size. And vice-versa. And I have a hunch that the optimal sizes for each are different. However, even if they turn out to be the same, having the flexibility to make them different wouldn't cause any problems, and it makes experimentation easier down the line.

Final Thoughts

My general feeling about good code and good API design is that it's as much about what you don't include as it is about what you do include. And that feeling is certainly reflected in this post: none of these items are things I wish I had added, and two of them are things I wish I hadn't.

But despite what I've written above, I'm very happy with Ropey as a library over-all. Is it perfect? Nope. But I do think it's solidly good, and I believe it's a great choice for anyone building a text editor.

2019 - 01 - 09

Rust Community Norms for Unsafe Code

I recently released Ropey 1.0, a text rope library for Rust. Ropey uses unsafe code internally, and its use of unsafe unsurprisingly came up in the 1.0 release thread on Reddit.

The ensuing discussion (especially thanks to Shnatsel) helped me significantly reduce the amount of unsafe code in Ropey with minimal (though not non-existent) performance degradation. But the whole thing nevertheless got me thinking about unsafe code and community norms around it, and I figured writing some of those thoughts down might be useful.

My hope is this post will be part of a community-wide discussion, and I would love for others (likely smarter than me) to write their thoughts on this topic as well. This post is simply my take on it.

Different Use-cases

I've gotten the impression that, on-the-whole, the Rust community's attitude towards unsafe code is something along the lines of, "Don't ever use it unless you're a wizard... and you're not a wizard." Unsafe code, it seems to me, is seen as this incredibly dangerous thing, and if it's in your crate or application then there's something wrong.

(Having said that, I'm not at all sure whether this impression is actually accurate, but I imagine if I've gotten that impression, many others likely have as well.)

In any case, I absolutely don't see unsafe code that way, and I think this is because my use-cases for Rust are different.

As I understand it, Rust more-or-less began its life at Mozilla with the goal of making browsers more secure. From that standpoint I largely agree with avoiding unsafe code. We've seen time and time again that memory safety bugs are both easy to introduce and a significant attack vector. Eliminating that whole class of bugs is good for security.

However, for my own projects I'm mostly uninterested in the web. I suspect this puts me in a minority, both within the Rust community and within the software development community as a whole. The vast majority of software developed today seems to be web-facing in some way or another. However, the software I'm most interested in writing runs locally on people's computers and is not web-facing. For example, my 3d renderer.

Now, I'm not saying there aren't security concerns with local software. Local software has a way of eventually finding itself on networks (for example, render farms in the case of 3d rendering). Nevertheless, the cost-benefit analysis is different: trading some risk of unsafety for better performance is almost always the right choice for something like an offline 3d renderer.

So for me Rust's memory safety guarantees aren't about security, they're about making software development easier. Tracking down memory safety bugs is a pain—in my experience, far more of a pain than fighting the borrow checker. But if I need to drop into unsafe code in a few places to squeeze more performance out of my renderer, I'm certainly going to do it.

Of course, there are a lot of use-cases where that is absolutely the wrong trade-off, and I suspect that includes most software being written today. But I think it's worth making this distinction: different software has (legitimately) different priorities. And I would like to see more acknowledgment of that in the Rust ecosystem.

Which leads me to the next part of this post...

Marking Crates

Ropey has a prominent notice in its readme about its use of unsafe code. I put it there precisely because of varying use-cases: people who are writing software where security is a high priority probably shouldn't use Ropey, or should only utilize it in a compartmentalized way. On the other hand, people who are writing high performance text editors that can handle crazy-huge files without breaking a sweat...? That's precisely what I designed Ropey for.

Ropey's readme notice—or something like it—is something I would like to see adopted as a norm in the Rust community. Different software has different needs, and I don't think a one-size-fits-all attitude towards unsafe code makes sense. But I do think we all need to be forthright about the priorities of our crates. In fact, that forthrightness probably shouldn't be limited to just the use of unsafe code.

So I guess what I'm saying is: I don't know exactly what I want this to look like, but I would like there to be some generally accepted way for crates to advertise what their priorities are. This would allow people to better choose crates with trade-offs and cost/benefit analysis appropriate to their own projects.

Safety Feature Flag

At the most recent Seattle Rust Meetup I had a great conversation about unsafety with Ivan (don't know his last name, alas, but thanks Ivan!) and he had an excellent idea for Ropey that I think may be more widely applicable: provide a feature flag that switches to an all-safe version of Ropey. In Ropey's case this is pretty straightforward: the unsafe code is well compartmentalized, and could be easily swapped out with safe (but slower / more memory hungry) equivalents based on a feature flag.

I am seriously considering this for Ropey, but I think it can apply to other crates as well. And maybe having a standard flag for this would be good. It obviously wouldn't be useful for every crate, but it would allow crate consumers to make this trade-off themselves whenever a crate does provide the flag.

But Can't We Just Make Safe Code Faster?

One of the things I legitimately love about the Rust ecosystem is the push to get unsafe code compartmentalized into crates with safe APIs. Having a rich set of crates that are working hard to make unsafe code correct, and expose it safely for others, is super awesome.

However, this will never be able to cover all possible situations. Covering reasonably common cases is feasible, but eventually it becomes a game of whac-a-mole. There are always going to be things that won't be covered in the ecosystem, especially the more niche you get.

Moreover, some optimizations are specific to the needs of a holistic design. (John Carmack's talk about systems engineering explains this much better than I can, and is also just a great talk.) And sometimes these sorts of optimizations will require unsafe code, and because of their specificity won't be generally useful as a separate crate.

For these and other reasons, I believe there will always be legitimate cases for using unsafe code in one-off ways. The more we can shrink the number of such cases, the better, for sure. But it will never shrink to zero.

Wrapping Up

I don't mean any of this to be an "excuse" for people using unsafe code haphazardly. Even in my own case, I was able to significantly reduce (though not eliminate) the unsafe code in Ropey with a little prodding from Shnatsel on Reddit. When you're in a performance mindset it's easy to reach for the lowest level before you strictly need to.

But legitimate uses of unsafe code are inevitable, and I think a "know and broadcast your use-case" ethos would be healthier and more useful for the Rust community than a more-or-less "don't use unsafe code" attitude.

2018 - 12 - 12

Rust 2019 - It's the Little Things

(This post is in response to this call for posts on the Rust blog. Also, this post focuses on Rust as a language, because I feel like the crates ecosystem is something that will naturally grow and mature over time as more people use Rust for more use-cases. Also, in most respects I agree strongly with Jonathan Turner's Rust 2019 post, which has a similar flavor to this one.)

This might be an uncommon opinion—especially among those motivated enough to write a Rust 2019 post—but I actually think Rust is pretty much at a good place now. For the kinds of things that I want to do (e.g. my path tracer), there isn't much that Rust is lacking as a language. There are some fiddly things like "placement new" that could be useful, but nothing really major. And of course, well-designed new features are always welcome, they just don't seem particularly critical to me at this point.

In other words, I'm pretty much satisfied. Mission accomplished, as far as I'm concerned. I think the rest is just polish. Just the little things.

But... the little things aren't necessarily little in terms of effort and hours needed to accomplish them. I don't think any of the following is trivial. I mean "little" with respect to shininess and visibility/marketing optics. Also, little != unimportant here.

So without further delay, these are the things I would like to see Rust focus on in 2019.

Type-level Integers (a.k.a. Integers as Generic Arguments)

This paper cut is pretty easy to run into. I was at the Seattle Rust Meetup this month, and while helping out a new rustacian we ended up stumbling over the list of impls for fixed-size arrays in the standard library. It's... silly. Tons of impls that are identical except for varying the length of the array. I felt like I had to apologize to the person I was helping, "Yeah... yeah, it's super stupid. Yeah, I'm sorry. Yeah, you're totally right. There are plans to improve the situation. No, I'm not sure when, or how far along it is. But you can work around it this way."

I realize that allowing values of (almost) any type as generic arguments is a seductive and appealing idea. But from my perspective, Rust already has type-level integer arguments, in the form of fixed-size arrays. It's just that users don't have access to them for their own types, nor for impls of their own traits over arrays. It's a lot like how Go has built-in generic maps, etc. but you can't make your own generic types in it. There's this sort of implicit acknowledgement in the language design that "this is useful to be able to do", but at the same time it doesn't let you do it yourself.

So I think that just focusing on pushing type-level integers across the finish line should be a goal for 2019, without all the complications of more general type-level values. It's a long-standing paper cut that feels like a missing feature, rather than a wish-list new feature. It makes Rust feel inconsistent, and has concrete, negative, practical impacts even in the standard library itself as per the link above.

Expanding Supported Architectures

This isn't directly interesting to me—I'm pretty happy in x86/x64 land. But moving more architectures to the higher support tiers seems like it would be good. Rust is a "systems" programming language, and even though people don't seem to agree what "systems programming" means any more than they agree on what "art" means, I nevertheless think having a rich set of well-supported architectures is worth aiming for in a language like Rust. It would be nice if people didn't have to reach for C or other memory unsafe languages just because they're targeting a different chipset.

Compile Times

This is a boring topic to bring up, but Rust's compile times still leave a lot to be desired. I am really pleased with the progress already made, and in no way mean to belittle that or the efforts that have led to that progress. It's more to say: more please! I think this is actually really important. Performance matters. In fact, caring about performance is one of the reasons for using Rust, right? So I think prioritizing performance and other not-so-shiny things (as opposed to e.g. new features) for 2019 might be worth considering.

In Short...

This is all to say: I think Rust is great, and I'm actually really happy with where it's at now. I realize that my use-cases are not the same as everyone else's (e.g. I don't personally have much use for async/await). But there are a lot of little things that I would like to see some focus put on. That is my hope for 2019: the year of polish. Not shiny new features, not even ergonomics improvements per se as in the previous cycle. But polish on technical items.

Rust 2021 Edition

As for the possible future Rust 2021 Edition... well, this may also be an uncommon opinion, but I hope we don't have one. Frankly, I hope we don't ever have another edition of Rust at all. I hope 2018 is the first and last one.

That may seem harsh, but let me clarify: Rust 2018 was absolutely a good thing, and I'm glad that we did it. But it was also a bad thing, in that it makes the world of Rust more confusing.

This was highlighted for me at the recent Seattle Rust Meetup, where a new user couldn't get his code to compile. I also struggled to figure out why while helping him. Eventually someone else came by and asked, "Is this in Rust 2018?" We checked, and indeed the new project was Rust 2018, but his code was written in Rust 2015 style in a way that caused an error in 2018 (I don't recall how anymore).

Rust 2018 is backwards compatible in the sense that the compiler can still build 2015 code, and even use both 2015 and 2018 crates together. This is an impressive feat, and a great way to move forward with breaking changes. But technological compatibility is not the only important kind of compatibility.

All of the articles and tutorials on Rust, all of the Stack Overflow questions and answers, and really anything written about Rust at all is now not only out of date but incompatible with the user's first out-of-the-box experience. We went through this once already with Rust 1.0. It's not a good thing. It's quite bad, actually. And I would really, really like to avoid doing it again.

So what I really meant above is not that we shouldn't ever have another edition, but rather that we should try not to. If there are good reasons and things we really want to improve, sure, let's go for it. But let's absolutely not take editions as a given thing, and certainly not as a regular thing.

Let's make new editions only when justified--when the benefits outweigh the drawbacks. Not as part of a cadence. And let's try to make them as infrequent as possible. If we can go ten years without a new edition, I think that should be viewed as a good thing. Editions are a tool we can reach for when needed, but they are a tool with a cost.